{"id":36095,"date":"2024-11-01T09:45:43","date_gmt":"2024-11-01T09:45:43","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36095"},"modified":"2024-11-01T09:45:43","modified_gmt":"2024-11-01T09:45:43","slug":"using-hugging-face-transformers-bigbird-tokenization-and-encoding","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36095\/","title":{"rendered":"Using Hugging Face Transformers, BigBird Tokenization and Encoding"},"content":{"rendered":"<p><body><\/p>\n<p>\n        In the field of deep learning, natural language processing (NLP) is one of the greatest success stories of machine learning and AI. Many researchers and companies are utilizing NLP technologies to process data, understand text, and create conversational AI systems. In this article, we will explore tokenization and encoding methods based on the BigBird model using the Hugging Face Transformers library.\n    <\/p>\n<h2>1. Introduction to Hugging Face Transformers Library<\/h2>\n<p>\n        Hugging Face is well known as a library that helps users easily access natural language processing (NLP) models, datasets, and tools. Through this library, we can leverage various pre-trained models to perform NLP tasks. One of the main advantages of this library is that it allows easy usage and fine-tuning of diverse NLP models.\n    <\/p>\n<h2>2. Overview of BigBird Model<\/h2>\n<p>\n        BigBird is a Transformer-based model developed by Google, designed to overcome the input length limitations of traditional Transformer models. Standard Transformer models have the drawback of exponentially increasing memory and computational costs when the input length is long. BigBird addresses this issue by introducing a Sparse Attention Mechanism.\n    <\/p>\n<p>\n        The main features of BigBird are as follows:<\/p>\n<ul>\n<li>Low memory consumption: Reduces memory usage through Sparse Attention.<\/li>\n<li>Long input processing: Capable of handling long inputs like documents.<\/li>\n<li>Performance improvements on various NLP tasks: Exhibits excellent performance in tasks like document classification, summarization, and question answering.<\/li>\n<\/ul>\n<h2>3. BigBird Tokenizer<\/h2>\n<p>\n        To use the BigBird model, we first need to tokenize the data. Tokenization is the process of splitting text into individual tokens. The Hugging Face Transformers library provides various tokenizers tailored to different models.\n    <\/p>\n<h3>3.1. Installing the BigBird Tokenizer<\/h3>\n<p>\n        To use the BigBird tokenizer, you must first install the necessary package. You can run the following Python code to install it:\n    <\/p>\n<pre><code>!pip install transformers<\/code><\/pre>\n<h3>3.2. How to Use the BigBird Tokenizer<\/h3>\n<p>\n        Once the installation is complete, you can initialize the BigBird tokenizer and tokenize text data using the following code:\n    <\/p>\n<pre><code>\nfrom transformers import BigBirdTokenizer\n\n# Initialize BigBird tokenizer\ntokenizer = BigBirdTokenizer.from_pretrained('google\/bigbird-base')\n\n# Example text\ntext = \"Deep learning and natural language processing are very interesting fields.\"\n\n# Tokenizing the text\ntokens = tokenizer.tokenize(text)\nprint(\"Tokenization result:\", tokens)\n    <\/code><\/pre>\n<h2>4. BigBird Encoding<\/h2>\n<p>\n        After tokenization, the tokens need to be encoded into a format suitable for model input. The encoding process converts tokens into integer index forms and generates padding and attention masks in the process.\n    <\/p>\n<h3>4.1. How to Use BigBird Encoding<\/h3>\n<p>\n        You can perform data encoding using the following code:\n    <\/p>\n<pre><code>\n# Encoding the text\nencoded_input = tokenizer.encode_plus(\n    text,\n    padding='max_length',  # Padding to max length\n    truncation=True,      # Truncate if length is long\n    return_tensors='pt'  # Return in PyTorch tensor format\n)\n\nprint(\"Encoding result:\", encoded_input)\n# Example output: {'input_ids': ..., 'attention_mask': ...}\n    <\/code><\/pre>\n<h2>5. Example Using the Model<\/h2>\n<p>\n        Now, let&#8217;s look at the process of inputting the encoded input into the BigBird model and checking the results. The following example code shows how to generate embeddings for the input text using the pre-trained BigBird model.\n    <\/p>\n<pre><code>\nfrom transformers import BigBirdModel\n\n# Initialize BigBird model\nmodel = BigBirdModel.from_pretrained('google\/bigbird-base')\n\n# Input the model and receive output\noutput = model(**encoded_input)\n\n# Model output embeddings\nprint(\"Model output:\", output)\n    <\/code><\/pre>\n<h2>6. Application Example: Text Classification<\/h2>\n<p>\n        Let\u2019s examine an example of long document text classification using the BigBird model. This process includes preparing the dataset, training the model, and predicting test data.\n    <\/p>\n<h3>6.1. Preparing the Dataset<\/h3>\n<p>\n        The dataset should generally be prepared in an agreed format. You can generate simple sample data using the code below:\n    <\/p>\n<pre><code>\nimport pandas as pd\n\n# Create sample data\ndata = {\n    'text': [\n        \"This is a positive review.\",\n        \"I was completely disappointed. I would never recommend it.\",\n        \"This product is really good.\",\n        \"Not good.\",\n    ],\n    'label': [1, 0, 1, 0]  # Positive is 1, Negative is 0\n}\n\ndf = pd.DataFrame(data)\nprint(df)\n    <\/code><\/pre>\n<h3>6.2. Data Preprocessing<\/h3>\n<p>\n        Before passing the data to the model, you need to apply encoding and padding. The following steps are taken:\n    <\/p>\n<pre><code>\n# Encoding all text data\nencodings = tokenizer(df['text'].tolist(), padding=True, truncation=True, return_tensors='pt')\nlabels = torch.tensor(df['label'].tolist())\n    <\/code><\/pre>\n<h3>6.3. Model Training<\/h3>\n<p>\n        The training process allows the model to learn from the data. In this simple example, we will skip the settings for the number of epochs and the optimizer.\n    <\/p>\n<pre><code>\nfrom transformers import AdamW\n\n# Optimizer settings\noptimizer = AdamW(model.parameters(), lr=1e-5)\n\n# Training loop\nfor epoch in range(3):  # 3 epochs\n    model.train()\n    outputs = model(**encodings)\n    loss = outputs.loss\n    loss.backward()\n    optimizer.step()\n    optimizer.zero_grad()\n    print(f\"EPOCH {epoch + 1} \/ 3: Loss: {loss.item()}\")\n    <\/code><\/pre>\n<h3>6.4. Model Evaluation<\/h3>\n<p>\n        To evaluate the model&#8217;s performance, we apply the pre-trained model to the test data.\n    <\/p>\n<pre><code>\nmodel.eval()\nwith torch.no_grad():\n    test_output = model(**encodings)\n    predictions = test_output.logits.argmax(dim=1)\n    \nprint(\"Prediction results:\", predictions)\n    <\/code><\/pre>\n<h2>7. Conclusion and Additional References<\/h2>\n<p>\n        In this article, we examined the tokenization and encoding processes of the BigBird model using the Hugging Face Transformers library. BigBird, which overcomes the limitations of existing Transformer architectures, shows improved performance in NLP tasks involving long documents.\n    <\/p>\n<p>\n        For more information and examples, please refer to the official documentation of [Hugging Face](https:\/\/huggingface.co\/docs\/transformers\/index). I hope this article helps you dive deeper into the world of deep learning and natural language processing.\n    <\/p>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the field of deep learning, natural language processing (NLP) is one of the greatest success stories of machine learning and AI. Many researchers and companies are utilizing NLP technologies to process data, understand text, and create conversational AI systems. In this article, we will explore tokenization and encoding methods based on the BigBird model &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36095\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Using Hugging Face Transformers, BigBird Tokenization and Encoding&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36095","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Using Hugging Face Transformers, BigBird Tokenization and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36095\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Hugging Face Transformers, BigBird Tokenization and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"In the field of deep learning, natural language processing (NLP) is one of the greatest success stories of machine learning and AI. Many researchers and companies are utilizing NLP technologies to process data, understand text, and create conversational AI systems. In this article, we will explore tokenization and encoding methods based on the BigBird model &hellip; \ub354 \ubcf4\uae30 &quot;Using Hugging Face Transformers, BigBird Tokenization and Encoding&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36095\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:45:43+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"4\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36095\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36095\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Using Hugging Face Transformers, BigBird Tokenization and Encoding\",\"datePublished\":\"2024-11-01T09:45:43+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36095\/\"},\"wordCount\":606,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36095\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36095\/\",\"name\":\"Using Hugging Face Transformers, BigBird Tokenization and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:45:43+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36095\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36095\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36095\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Hugging Face Transformers, BigBird Tokenization and Encoding\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Hugging Face Transformers, BigBird Tokenization and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36095\/","og_locale":"ko_KR","og_type":"article","og_title":"Using Hugging Face Transformers, BigBird Tokenization and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"In the field of deep learning, natural language processing (NLP) is one of the greatest success stories of machine learning and AI. Many researchers and companies are utilizing NLP technologies to process data, understand text, and create conversational AI systems. In this article, we will explore tokenization and encoding methods based on the BigBird model &hellip; \ub354 \ubcf4\uae30 \"Using Hugging Face Transformers, BigBird Tokenization and Encoding\"","og_url":"https:\/\/atmokpo.com\/w\/36095\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:45:43+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"4\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36095\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36095\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Using Hugging Face Transformers, BigBird Tokenization and Encoding","datePublished":"2024-11-01T09:45:43+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36095\/"},"wordCount":606,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36095\/","url":"https:\/\/atmokpo.com\/w\/36095\/","name":"Using Hugging Face Transformers, BigBird Tokenization and Encoding - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:45:43+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36095\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36095\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36095\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Using Hugging Face Transformers, BigBird Tokenization and Encoding"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36095","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36095"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36095\/revisions"}],"predecessor-version":[{"id":36096,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36095\/revisions\/36096"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36095"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36095"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36095"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}