{"id":36149,"date":"2024-11-01T09:46:08","date_gmt":"2024-11-01T09:46:08","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36149"},"modified":"2024-11-01T09:46:08","modified_gmt":"2024-11-01T09:46:08","slug":"using-hugging-face-transformers-preparing-korean-text-for-m2m100-translation-source","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36149\/","title":{"rendered":"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source"},"content":{"rendered":"<p><body><\/p>\n<h2>1. Introduction<\/h2>\n<p>Recent advancements in deep learning have brought significant changes to the field of Natural Language Processing (NLP). In particular, Hugging Face&#8217;s Transformers library provides various language models that greatly assist NLP researchers and developers. In this course, we will explain in detail how to prepare data for Korean text translation using the M2M100 model.<\/p>\n<h2>2. Overview of Hugging Face Transformers<\/h2>\n<p>Hugging Face Transformers is a library that makes it easy to use various state-of-the-art language models. This library offers numerous pre-trained models, including BERT, GPT-2, T5, and M2M100, allowing users to perform NLP tasks effortlessly without complex customization. In particular, the M2M100 model is specifically designed for multilingual translation, excelling in performance across multiple languages.<\/p>\n<h2>3. Introduction to the M2M100 Model<\/h2>\n<p>M2M100 stands for &#8220;Multilingual to Multilingual,&#8221; supporting translation tasks between over 100 languages. This model is trained on diverse language data, providing effective translations regardless of the source and target languages. Here are the main features of M2M100:<\/p>\n<ul>\n<li>Supports over 100 languages<\/li>\n<li>Can translate between source and target languages<\/li>\n<li>Applicable to various natural language processing tasks<\/li>\n<\/ul>\n<h2>4. Environment Setup<\/h2>\n<p>This course will utilize Python and the Hugging Face Transformers library. You can set up your environment using the following procedures.<\/p>\n<h3>4.1. Installing Python<\/h3>\n<p>You need to install the latest version of Python. It can be downloaded and installed from the official website.<\/p>\n<h3>4.2. Installing Required Libraries<\/h3>\n<p>Install Hugging Face&#8217;s Transformers library and other necessary libraries. Use the following command to do so:<\/p>\n<pre><code>pip install transformers torch<\/code><\/pre>\n<h2>5. Preparing Korean Text<\/h2>\n<p>To perform translation tasks using the M2M100 model, an appropriate dataset is required. Here, we will describe how to prepare Korean text.<\/p>\n<h3>5.1. Data Collection<\/h3>\n<p>You can obtain Korean text data from various sources. Text can be crawled from news articles, blogs, websites, etc. Text preprocessing is also crucial during this process.<\/p>\n<h3>5.2. Data Preprocessing<\/h3>\n<p>The collected data must go through deduplication, removal of unnecessary symbols, and refinement processes. The basic preprocessing steps are as follows:<\/p>\n<pre><code>import re\n\ndef preprocess_text(text):\n    # Convert to lowercase\n    text = text.lower()\n    # Remove unnecessary symbols\n    text = re.sub(r'[^\uac00-\ud7a3A-Za-z0-9\\s]', '', text)\n    return text\n\nsample_text = \"Hello! Welcome to the deep learning course.\"\ncleaned_text = preprocess_text(sample_text)\nprint(cleaned_text)<\/code><\/pre>\n<h3>5.3. Example of Korean Data<\/h3>\n<p>Typically, you prepare several sentences to translate to create a dataset. For example:<\/p>\n<pre><code>korean_sentences = [\n    \"I love deep learning.\",\n    \"The advancement of artificial intelligence is amazing.\",\n    \"Hugging Face is a really useful library.\"\n]<\/code><\/pre>\n<h2>6. Translating with M2M100<\/h2>\n<p>Once the Korean dataset is prepared, it&#8217;s time to perform translation using the M2M100 model. We will translate Korean sentences into English using the code below.<\/p>\n<pre><code>from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer\n\n# Load model and tokenizer\nmodel_name = \"facebook\/m2m100_418M\"\ntokenizer = M2M100Tokenizer.from_pretrained(model_name)\nmodel = M2M100ForConditionalGeneration.from_pretrained(model_name)\n\ndef translate_text(text, source_lang=\"ko\", target_lang=\"en\"):\n    # Tokenize the text\n    tokenizer.src_lang = source_lang\n    encoded_text = tokenizer(text, return_tensors=\"pt\")\n    \n    # Generate translation\n    generated_tokens = model.generate(**encoded_text, forced_bos_token_id=tokenizer.get_lang_id(target_lang))\n    \n    # Return the decoded translation\n    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]\n\n# Perform translation\nfor sentence in korean_sentences:\n    translated_sentence = translate_text(sentence)\n    print(f\"Original: {sentence}\\nTranslation: {translated_sentence}\\n\")<\/code><\/pre>\n<h2>7. Conclusion<\/h2>\n<p>In this course, we explained how to prepare Korean text data and perform translation using the M2M100 model. We can see that by utilizing Hugging Face&#8217;s Transformers library, complex tasks can be performed simply and efficiently. We hope this course enhances your understanding of natural language processing and lays the foundation for applying it to real projects.<\/p>\n<h2>8. References<\/h2>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/transformers\/\">Hugging Face Transformers Documentation<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/2010.11125\">M2M100: A Massively Multilingual Neural Machine Translation Model<\/a><\/li>\n<\/ul>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>1. Introduction Recent advancements in deep learning have brought significant changes to the field of Natural Language Processing (NLP). In particular, Hugging Face&#8217;s Transformers library provides various language models that greatly assist NLP researchers and developers. In this course, we will explain in detail how to prepare data for Korean text translation using the M2M100 &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36149\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36149","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36149\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"1. Introduction Recent advancements in deep learning have brought significant changes to the field of Natural Language Processing (NLP). In particular, Hugging Face&#8217;s Transformers library provides various language models that greatly assist NLP researchers and developers. In this course, we will explain in detail how to prepare data for Korean text translation using the M2M100 &hellip; \ub354 \ubcf4\uae30 &quot;Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36149\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:46:08+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"3\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36149\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36149\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source\",\"datePublished\":\"2024-11-01T09:46:08+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36149\/\"},\"wordCount\":458,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36149\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36149\/\",\"name\":\"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:46:08+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36149\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36149\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36149\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36149\/","og_locale":"ko_KR","og_type":"article","og_title":"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"1. Introduction Recent advancements in deep learning have brought significant changes to the field of Natural Language Processing (NLP). In particular, Hugging Face&#8217;s Transformers library provides various language models that greatly assist NLP researchers and developers. In this course, we will explain in detail how to prepare data for Korean text translation using the M2M100 &hellip; \ub354 \ubcf4\uae30 \"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source\"","og_url":"https:\/\/atmokpo.com\/w\/36149\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:46:08+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"3\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36149\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36149\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source","datePublished":"2024-11-01T09:46:08+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36149\/"},"wordCount":458,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36149\/","url":"https:\/\/atmokpo.com\/w\/36149\/","name":"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:46:08+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36149\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36149\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36149\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Using Hugging Face Transformers, Preparing Korean Text for M2M100 Translation Source"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36149","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36149"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36149\/revisions"}],"predecessor-version":[{"id":36150,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36149\/revisions\/36150"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36149"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36149"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36149"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}