{"id":36207,"date":"2024-11-01T09:46:39","date_gmt":"2024-11-01T09:46:39","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36207"},"modified":"2024-11-01T09:46:39","modified_gmt":"2024-11-01T09:46:39","slug":"using-hugging-face-transformers-course-source-language-m2m100-tokenization","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36207\/","title":{"rendered":"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization"},"content":{"rendered":"<p><body><\/p>\n<p>With the development of deep learning, Natural Language Processing (NLP) has undergone significant changes. In particular, Hugging Face&#8217;s transformers library has established itself as a powerful tool for NLP tasks. In this course, we will introduce the multilingual translation and tokenization process using the M2M100 model.<\/p>\n<h2>Overview of the M2M100 Model<\/h2>\n<p>The M2M100 (Multilingual To Multilingual Translation) model is a multilingual model that supports direct translation between more than 100 languages. Existing translation models used an indirect translation method that translated from the source language to an intermediate language (e.g., English) and then converted to the target language. The M2M100 overcomes this limitation by enabling direct conversion among multiple languages, significantly improving translation efficiency between various language pairs.<\/p>\n<h2>What is Tokenization?<\/h2>\n<p>Tokenization is the process of dividing the input text into smaller units called tokens. After converting it into a list format, unique indices are assigned to each token. Tokenization is an essential process in NLP and is necessary before inputting text data into the model.<\/p>\n<h2>Environment Setup<\/h2>\n<p>Before proceeding with the course, you need to install the required libraries. Specifically, we will install <code>transformers<\/code> and <code>torch<\/code>. You can install them with the following command:<\/p>\n<pre>\n        <code>pip install transformers torch<\/code>\n    <\/pre>\n<h2>Loading the Tokenizer<\/h2>\n<p>To load the tokenizer for the M2M100 model, we will use the <code>M2M100Tokenizer<\/code> class provided by the <code>transformers<\/code> library.<\/p>\n<pre>\n        <code>\nimport torch\nfrom transformers import M2M100Tokenizer\n\n# Load the tokenizer for the M2M100 model\ntokenizer = M2M100Tokenizer.from_pretrained(\"facebook\/m2m100_418M\")\n        <\/code>\n    <\/pre>\n<h2>Tokenization Process<\/h2>\n<p>Now we are ready to tokenize the text. Below is an example of tokenizing the sentence &#8216;Hello, everyone!&#8217;.<\/p>\n<pre>\n        <code>\n# Input text\ntext = \"Hello, everyone!\"\n\n# Tokenizing the text\nencoded_input = tokenizer(text, return_tensors=\"pt\")\n\n# Output tokens and indices\nprint(\"Tokenized tokens:\", tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0]))\nprint(\"Token indices:\", encoded_input['input_ids'])\n        <\/code>\n    <\/pre>\n<h3>Tokenization Result<\/h3>\n<p>The output generated after running the above code shows how the input text has been tokenized and indexed. You can check the actual values of the tokens using the <code>convert_ids_to_tokens<\/code> method.<\/p>\n<h2>Multilingual Translation<\/h2>\n<p>Using the tokenized data, we can perform multilingual translation. I will show you an example of translating Korean to English using the M2M100 model.<\/p>\n<pre>\n        <code>\nfrom transformers import M2M100ForConditionalGeneration\n\n# Load the M2M100 model\nmodel = M2M100ForConditionalGeneration.from_pretrained(\"facebook\/m2m100_418M\")\n\n# Korean text\ntext = \"Hello, everyone!\"\nencoded_input = tokenizer(text, return_tensors=\"pt\")\n\n# Translation\ntranslated_tokens = model.generate(**encoded_input, forced_bos_token_id=tokenizer.get_lang_id(\"en\"))\n\n# Translation result\ntranslated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)\nprint(\"Translation result:\", translated_text[0])\n        <\/code>\n    <\/pre>\n<h3>Interpretation of the Translation Result<\/h3>\n<p>You can check if the Korean sentence has been accurately translated into English using the code above. The <code>generate<\/code> method generates the translated result based on the input token data.<\/p>\n<h2>Conclusion<\/h2>\n<p>In this course, we explored the multilingual tokenization and translation process using Hugging Face&#8217;s M2M100 model. The progress in the field of natural language processing will continue, and using such tools will enable better communication across various languages. I hope that interest and research in NLP and deep learning will continue in the future.<\/p>\n<h2>References<\/h2>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/docs\/transformers\/index\" target=\"_blank\" rel=\"noopener\">Hugging Face Transformers Documentation<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/huggingface\/transformers\" target=\"_blank\" rel=\"noopener\">Transformers GitHub Repository<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2010.11125\" target=\"_blank\" rel=\"noopener\">M2M-100: A Massively Multilingual Translation Model<\/a><\/li>\n<\/ul>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>With the development of deep learning, Natural Language Processing (NLP) has undergone significant changes. In particular, Hugging Face&#8217;s transformers library has established itself as a powerful tool for NLP tasks. In this course, we will introduce the multilingual translation and tokenization process using the M2M100 model. Overview of the M2M100 Model The M2M100 (Multilingual To &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36207\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Using Hugging Face Transformers Course, Source Language M2M100 Tokenization&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36207","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Using Hugging Face Transformers Course, Source Language M2M100 Tokenization - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36207\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"With the development of deep learning, Natural Language Processing (NLP) has undergone significant changes. In particular, Hugging Face&#8217;s transformers library has established itself as a powerful tool for NLP tasks. In this course, we will introduce the multilingual translation and tokenization process using the M2M100 model. Overview of the M2M100 Model The M2M100 (Multilingual To &hellip; \ub354 \ubcf4\uae30 &quot;Using Hugging Face Transformers Course, Source Language M2M100 Tokenization&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36207\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:46:39+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"3\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36207\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36207\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization\",\"datePublished\":\"2024-11-01T09:46:39+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36207\/\"},\"wordCount\":414,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36207\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36207\/\",\"name\":\"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:46:39+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36207\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36207\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36207\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36207\/","og_locale":"ko_KR","og_type":"article","og_title":"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"With the development of deep learning, Natural Language Processing (NLP) has undergone significant changes. In particular, Hugging Face&#8217;s transformers library has established itself as a powerful tool for NLP tasks. In this course, we will introduce the multilingual translation and tokenization process using the M2M100 model. Overview of the M2M100 Model The M2M100 (Multilingual To &hellip; \ub354 \ubcf4\uae30 \"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization\"","og_url":"https:\/\/atmokpo.com\/w\/36207\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:46:39+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"3\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36207\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36207\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization","datePublished":"2024-11-01T09:46:39+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36207\/"},"wordCount":414,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36207\/","url":"https:\/\/atmokpo.com\/w\/36207\/","name":"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:46:39+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36207\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36207\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36207\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Using Hugging Face Transformers Course, Source Language M2M100 Tokenization"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36207","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36207"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36207\/revisions"}],"predecessor-version":[{"id":36208,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36207\/revisions\/36208"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36207"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36207"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36207"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}