{"id":36231,"date":"2024-11-01T09:46:51","date_gmt":"2024-11-01T09:46:51","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36231"},"modified":"2024-11-01T09:46:51","modified_gmt":"2024-11-01T09:46:51","slug":"using-hugging-face-transformers-tokenizing-and-encoding","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36231\/","title":{"rendered":"Using Hugging Face Transformers, Tokenizing and Encoding"},"content":{"rendered":"<p><body><\/p>\n<p>Deep learning and natural language processing have rapidly advanced in recent years, and during this process, the Hugging Face Transformer library has become one of the popular tools. In this course, we will deeply explain the concepts of tokenizing and encoding using the Hugging Face Transformer library, and learn how to implement these concepts in Python code.<\/p>\n<h2>1. What is a Transformer Model?<\/h2>\n<p>A transformer model is a deep learning model based on the attention mechanism, demonstrating high performance in language processing tasks. It was first introduced in the 2017 paper &#8220;Attention is All You Need.&#8221; This model effectively captures contextual information by considering all words in the input sequence simultaneously.<\/p>\n<h2>2. Hugging Face and Its Basics<\/h2>\n<p>Hugging Face is a platform that offers many pre-trained transformer models for free. This allows researchers and developers to easily perform various NLP tasks (e.g., question answering, text generation, sentiment analysis). By using Hugging Face&#8217;s <code>transformers<\/code> library, complex NLP tasks can be handled with ease.<\/p>\n<h2>3. Tokenizing<\/h2>\n<p>Tokenizing is the process of breaking down text into individual units (tokens). For example, splitting a sentence into words or breaking down words into subwords. This process is essential for transforming data into a form that models can understand.<\/p>\n<h3>3.1 Why is Tokenizing Important?<\/h3>\n<p>Transformer models need to convert input text into fixed-length token sequences. A well-implemented tokenizer can process input data better and enhance the model&#8217;s performance.<\/p>\n<h3>3.2 Using Hugging Face&#8217;s Tokenizer<\/h3>\n<p>The Hugging Face Transformer library includes several types of tokenizers. These are optimized for each model, meaning the tokenizers used for BERT, GPT-2, and T5 models differ.<\/p>\n<h4>Example: Tokenizing with BERT Model<\/h4>\n<pre><code>from transformers import BertTokenizer\n\n# Initialize BERT Tokenizer\ntokenizer = BertTokenizer.from_pretrained('bert-base-uncased')\n\n# Text Input\ntext = \"Welcome to the NLP course using Hugging Face Transformers!\"\n\n# Tokenizing Text\ntokens = tokenizer.tokenize(text)\nprint(tokens)\n<\/code><\/pre>\n<p>When you run the code above, you can see the input text split into individual tokens. However, this process must also convert the tokens into a format suitable for the model.<\/p>\n<h3>4. Encoding<\/h3>\n<p>After tokenizing, we need to convert tokens into numerical forms that the model can input. This process is called encoding, and it generally uses the index of each word to convert tokens into numbers.<\/p>\n<h4>Example: Encoding with BERT Model<\/h4>\n<pre><code># Text Encoding\nencoded_input = tokenizer.encode(text, return_tensors='pt')\nprint(encoded_input)\n<\/code><\/pre>\n<p>Here, <code>return_tensors='pt'<\/code> means to return a PyTorch tensor. This is a form that can be directly input into deep learning models.<\/p>\n<h2>5. Integrated Example: Data Preprocessing for Text Classification<\/h2>\n<p>Now, let&#8217;s integrate the tokenizing and encoding processes we&#8217;ve learned so far into a single example. Here, we will look at the process of preprocessing data for a simple text classification model.<\/p>\n<h3>5.1 Data Preparation<\/h3>\n<p>First, we need to prepare data for simple text classification. We will create a list that includes text and its respective labels.<\/p>\n<pre><code>texts = [\n    \"I really like Hugging Face.\",\n    \"Deep learning is hard but interesting.\",\n    \"AI technology is changing our lives.\",\n    \"The transformer model is really powerful.\",\n    \"This text is about cats.\"\n]\nlabels = [1, 1, 1, 1, 0]  # 1: Positive, 0: Negative\n<\/code><\/pre>\n<h3>5.2 Data Tokenizing and Encoding<\/h3>\n<p>Next, we will write the code to tokenize and encode the data. We will use a loop to perform tokenization and encoding for each text.<\/p>\n<pre><code># Initialize an empty list to hold all processed data\nencoded_texts = []\n\n# Perform tokenization and encoding for each text\nfor text in texts:\n    encoded_text = tokenizer.encode(text, return_tensors='pt')\n    encoded_texts.append(encoded_text)\n\nprint(encoded_texts)\n<\/code><\/pre>\n<h3>5.3 Input into the Model<\/h3>\n<p>Now we can input the encoded texts into the model to perform prediction tasks. For example, a text classification model can be used as follows.<\/p>\n<pre><code>from transformers import BertForSequenceClassification\nimport torch\n\n# Load BERT Model\nmodel = BertForSequenceClassification.from_pretrained('bert-base-uncased')\n\n# Switch model to evaluation mode\nmodel.eval()\n\n# Perform prediction for each encoded text\nfor encoded_text in encoded_texts:\n    with torch.no_grad():\n        outputs = model(encoded_text)\n        predictions = torch.argmax(outputs.logits, dim=-1)\n        print(f\"Predicted Label: {predictions.item()}\")\n<\/code><\/pre>\n<p>In the code above, we perform predictions for each encoded text to predict the respective labels of those texts. In this way, we can classify new texts using a deep learning model.<\/p>\n<h2>6. Conclusion<\/h2>\n<p>In this course, we learned about the importance of tokenizing and encoding using Hugging Face&#8217;s Transformer library. Additionally, we implemented a simple text classification model using these concepts. Hugging Face provides a powerful API along with various pre-trained models, making it easier to perform NLP tasks. I hope you continue your learning in deep learning and natural language processing in the future!<\/p>\n<h2>7. References<\/h2>\n<ul>\n<li><a href=\"https:\/\/huggingface.co\/docs\/transformers\/index\">Hugging Face Transformer Documentation<\/a><\/li>\n<li><a href=\"https:\/\/arxiv.org\/abs\/1706.03762\">Attention is All You Need (Paper)<\/a><\/li>\n<\/ul>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Deep learning and natural language processing have rapidly advanced in recent years, and during this process, the Hugging Face Transformer library has become one of the popular tools. In this course, we will deeply explain the concepts of tokenizing and encoding using the Hugging Face Transformer library, and learn how to implement these concepts in &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36231\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Using Hugging Face Transformers, Tokenizing and Encoding&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36231","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Using Hugging Face Transformers, Tokenizing and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36231\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Hugging Face Transformers, Tokenizing and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"Deep learning and natural language processing have rapidly advanced in recent years, and during this process, the Hugging Face Transformer library has become one of the popular tools. In this course, we will deeply explain the concepts of tokenizing and encoding using the Hugging Face Transformer library, and learn how to implement these concepts in &hellip; \ub354 \ubcf4\uae30 &quot;Using Hugging Face Transformers, Tokenizing and Encoding&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36231\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:46:51+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"4\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36231\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36231\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Using Hugging Face Transformers, Tokenizing and Encoding\",\"datePublished\":\"2024-11-01T09:46:51+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36231\/\"},\"wordCount\":594,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36231\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36231\/\",\"name\":\"Using Hugging Face Transformers, Tokenizing and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:46:51+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36231\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36231\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36231\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Hugging Face Transformers, Tokenizing and Encoding\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Hugging Face Transformers, Tokenizing and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36231\/","og_locale":"ko_KR","og_type":"article","og_title":"Using Hugging Face Transformers, Tokenizing and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"Deep learning and natural language processing have rapidly advanced in recent years, and during this process, the Hugging Face Transformer library has become one of the popular tools. In this course, we will deeply explain the concepts of tokenizing and encoding using the Hugging Face Transformer library, and learn how to implement these concepts in &hellip; \ub354 \ubcf4\uae30 \"Using Hugging Face Transformers, Tokenizing and Encoding\"","og_url":"https:\/\/atmokpo.com\/w\/36231\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:46:51+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"4\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36231\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36231\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Using Hugging Face Transformers, Tokenizing and Encoding","datePublished":"2024-11-01T09:46:51+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36231\/"},"wordCount":594,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36231\/","url":"https:\/\/atmokpo.com\/w\/36231\/","name":"Using Hugging Face Transformers, Tokenizing and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:46:51+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36231\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36231\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36231\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Using Hugging Face Transformers, Tokenizing and Encoding"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36231","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36231"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36231\/revisions"}],"predecessor-version":[{"id":36232,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36231\/revisions\/36232"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36231"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36231"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36231"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}