{"id":36189,"date":"2024-11-01T09:46:30","date_gmt":"2024-11-01T09:46:30","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36189"},"modified":"2024-11-01T09:46:30","modified_gmt":"2024-11-01T09:46:30","slug":"using-hugging-face-transformers-creating-dataset-class","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36189\/","title":{"rendered":"Using Hugging Face Transformers, Creating Dataset Class"},"content":{"rendered":"<p><body><\/p>\n<p>In this course, we will cover how to create a dataset class using the <strong>Hugging Face<\/strong> <strong>Transformers<\/strong> library. Hugging Face Transformers is one of the essential tools for natural language processing (NLP) tasks, providing various pre-trained models and datasets. To effectively utilize these models, it is important to create custom datasets.<\/p>\n<h2>1. What are Hugging Face Transformers?<\/h2>\n<p>The Hugging Face Transformers library is one of the most widely used libraries in the field of natural language processing in machine learning, including the latest models such as BERT, GPT-2, and T5. This library helps implement model training, fine-tuning, predictions, and more easily.<\/p>\n<h2>2. What is a Dataset Class?<\/h2>\n<p>A dataset class is a class that defines the structure of the data used for model training and evaluation. By using a dataset class, you can easily load and preprocess custom data. Hugging Face Transformers provides features to handle data easily through the <strong>datasets<\/strong> library.<\/p>\n<h2>3. How to Create a Dataset Class<\/h2>\n<p>In this section, we will discuss how to create a dataset class using Python. Specifically, we will explain how to inherit from the <code>torch.utils.data.Dataset<\/code> class to create a custom dataset class and integrate it with Hugging Face Transformers.<\/p>\n<h3>3.1 Getting Started<\/h3>\n<p>First, install and import the required libraries. Use the code below to install the <strong>transformers<\/strong> and <strong>datasets<\/strong> libraries.<\/p>\n<pre><code>!pip install transformers datasets torch<\/code><\/pre>\n<h3>3.2 Creating a Custom Dataset Class<\/h3>\n<p>Here, we will show you how to create a dataset class.<\/p>\n<pre><code>import torch\nfrom torch.utils.data import Dataset\n\nclass MyDataset(Dataset):\n    def __init__(self, texts, labels, tokenizer, max_length=512):\n        self.texts = texts\n        self.labels = labels\n        self.tokenizer = tokenizer\n        self.max_length = max_length\n\n    def __len__(self):\n        return len(self.texts)\n\n    def __getitem__(self, idx):\n        text = self.texts[idx]\n        label = self.labels[idx]\n\n        # Tokenization and index conversion\n        encoding = self.tokenizer.encode_plus(\n            text,\n            add_special_tokens=True,\n            max_length=self.max_length,\n            return_token_type_ids=False,\n            padding='max_length',\n            truncation=True,\n            return_attention_mask=True,\n            return_tensors='pt'\n        )\n\n        return {\n            'input_ids': encoding['input_ids'].flatten(),\n            'attention_mask': encoding['attention_mask'].flatten(),\n            'labels': torch.tensor(label, dtype=torch.long)\n        }<\/code><\/pre>\n<p>The above class is a dataset that takes text and labels as input, tokenizes the text, and returns labels as tensors. It inherits from the <code>torch.utils.data.Dataset<\/code> class and implements the <code>__len__<\/code> and <code>__getitem__<\/code> methods.<\/p>\n<h4>3.3 Using the Dataset<\/h4>\n<p>Now let&#8217;s look at how to use the custom dataset. Here\u2019s an example of how to prepare data and create a data loader.<\/p>\n<pre><code>from transformers import AutoTokenizer\n\n# Prepare the tokenizer\ntokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')\n\n# Prepare the data\ntexts = [\"Hello, how are you?\", \"I am fine, thank you.\"]\nlabels = [0, 1] # Example labels\n\n# Create a dataset instance\ndataset = MyDataset(texts, labels, tokenizer)\n\n# Create a data loader\nfrom torch.utils.data import DataLoader\n\ndataloader = DataLoader(dataset, batch_size=2, shuffle=True)\n\nfor batch in dataloader:\n    print(batch)\n<\/code><\/pre>\n<p>The above code creates a small dataset and then generates a data loader to return data in batches. The data loader allows random selection of data samples during training and returns them in batches.<\/p>\n<h2>4. Extending the Dataset Class<\/h2>\n<p>Now I will show you how to extend the custom dataset class to add more features. For example, you can include additional data preprocessing steps or handle multiple input formats.<\/p>\n<h3>4.1 Data Preprocessing<\/h3>\n<p>Data preprocessing is a crucial step in improving model performance. If necessary, you can implement preprocessing functionality in the <code>__init__<\/code> method.<\/p>\n<pre><code>def preprocess(self, text):\n        # Add preprocessing logic here\n        return text.lower().strip()\n<\/code><\/pre>\n<p>You can call this method in <code>__getitem__<\/code> to perform preprocessing before returning the data.<\/p>\n<h3>4.2 Handling Multiple Input Formats<\/h3>\n<p>If the dataset needs to handle various input formats, you can use conditional statements to process them differently. Just add conditions based on the format of the input text.<\/p>\n<pre><code>if isinstance(text, list):\n    text = \" \".join(text)  # Join list texts\n<\/code><\/pre>\n<h2>5. Conclusion<\/h2>\n<p>In this course, we learned how to create and use dataset classes in Hugging Face Transformers. Custom datasets are essential elements in training and evaluating models. Through this, we can efficiently process various formatted data and train models in our desired manner.<\/p>\n<p>In the future, make sure to utilize Hugging Face to solve more natural language processing problems. Also, try creating your own dataset class to build your skills. Thank you!<\/p>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this course, we will cover how to create a dataset class using the Hugging Face Transformers library. Hugging Face Transformers is one of the essential tools for natural language processing (NLP) tasks, providing various pre-trained models and datasets. To effectively utilize these models, it is important to create custom datasets. 1. What are Hugging &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36189\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Using Hugging Face Transformers, Creating Dataset Class&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36189","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Using Hugging Face Transformers, Creating Dataset Class - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36189\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Using Hugging Face Transformers, Creating Dataset Class - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"In this course, we will cover how to create a dataset class using the Hugging Face Transformers library. Hugging Face Transformers is one of the essential tools for natural language processing (NLP) tasks, providing various pre-trained models and datasets. To effectively utilize these models, it is important to create custom datasets. 1. What are Hugging &hellip; \ub354 \ubcf4\uae30 &quot;Using Hugging Face Transformers, Creating Dataset Class&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36189\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:46:30+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"4\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36189\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36189\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Using Hugging Face Transformers, Creating Dataset Class\",\"datePublished\":\"2024-11-01T09:46:30+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36189\/\"},\"wordCount\":503,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36189\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36189\/\",\"name\":\"Using Hugging Face Transformers, Creating Dataset Class - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:46:30+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36189\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36189\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36189\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Using Hugging Face Transformers, Creating Dataset Class\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Using Hugging Face Transformers, Creating Dataset Class - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36189\/","og_locale":"ko_KR","og_type":"article","og_title":"Using Hugging Face Transformers, Creating Dataset Class - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"In this course, we will cover how to create a dataset class using the Hugging Face Transformers library. Hugging Face Transformers is one of the essential tools for natural language processing (NLP) tasks, providing various pre-trained models and datasets. To effectively utilize these models, it is important to create custom datasets. 1. What are Hugging &hellip; \ub354 \ubcf4\uae30 \"Using Hugging Face Transformers, Creating Dataset Class\"","og_url":"https:\/\/atmokpo.com\/w\/36189\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:46:30+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"4\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36189\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36189\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Using Hugging Face Transformers, Creating Dataset Class","datePublished":"2024-11-01T09:46:30+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36189\/"},"wordCount":503,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36189\/","url":"https:\/\/atmokpo.com\/w\/36189\/","name":"Using Hugging Face Transformers, Creating Dataset Class - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:46:30+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36189\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36189\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36189\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Using Hugging Face Transformers, Creating Dataset Class"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36189","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36189"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36189\/revisions"}],"predecessor-version":[{"id":36190,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36189\/revisions\/36190"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36189"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36189"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36189"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}