{"id":36187,"date":"2024-11-01T09:46:29","date_gmt":"2024-11-01T09:46:29","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36187"},"modified":"2024-11-01T09:46:29","modified_gmt":"2024-11-01T09:46:29","slug":"how-to-use-hugging-face-transformers-preparing-datasets","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36187\/","title":{"rendered":"How to Use Hugging Face Transformers, Preparing Datasets"},"content":{"rendered":"<p><body><\/p>\n<p>The world of deep learning and natural language processing (NLP) is rapidly evolving, and within it, the <strong>Hugging Face Transformers<\/strong> library has become an essential tool for many researchers and developers. In this article, we will detail how to prepare a dataset using the Hugging Face Transformers library. Dataset preparation is the first step in model training, and high-quality data is crucial for achieving good results.<\/p>\n<h2>1. What is Hugging Face Transformers?<\/h2>\n<p>The <strong>Transformers<\/strong> library from Hugging Face is an open-source library designed to make it easy to use natural language processing models. This library provides a variety of pre-trained models and datasets, giving researchers a foundation to design and experiment with new models. It has a significant advantage in that it allows access to the latest NLP models at a low cost.<\/p>\n<h2>2. The Importance of Dataset Preparation<\/h2>\n<p>The performance of a model largely depends on the quality of the dataset. A well-structured dataset facilitates the training process of the model, and the diversity and quantity of the data significantly affect the model&#8217;s ability to generalize. Therefore, during the dataset preparation phase, the following considerations should be made:<\/p>\n<ul>\n<li><strong>Data Quality:<\/strong> It is important to use data with minimal duplicates and noise.<\/li>\n<li><strong>Data Diversity:<\/strong> The model must include various situations and cases to perform well in real-world environments.<\/li>\n<li><strong>Data Size:<\/strong> The more data available, the higher the model&#8217;s ability to generalize during training.<\/li>\n<\/ul>\n<h2>3. Downloading and Preparing the Dataset<\/h2>\n<p>Hugging Face provides various public datasets. Using these datasets allows for easy access to the data needed for model training. Now, let&#8217;s look at how to load and preprocess the dataset.<\/p>\n<h3>3.1. Installing the Hugging Face Datasets Library<\/h3>\n<p>First, you need to install the <strong>Datasets<\/strong> library from Hugging Face:<\/p>\n<pre><code>pip install datasets<\/code><\/pre>\n<h3>3.2. Loading the Dataset<\/h3>\n<p>Now, let&#8217;s learn how to load Hugging Face datasets in Python. For example, we will use the <strong>IMDB movie reviews dataset<\/strong>.<\/p>\n<pre><code>from datasets import load_dataset\n\n# Load IMDB dataset\ndataset = load_dataset(\"imdb\")\n\nprint(dataset)<\/code><\/pre>\n<p>Running the above code will load the dataset split into training and test sets. Next, here is how to check the structure of the dataset:<\/p>\n<pre><code># Print the first item of the dataset\nprint(dataset['train'][0])<\/code><\/pre>\n<h3>3.3. Preprocessing the Dataset<\/h3>\n<p>After loading the dataset, it needs to be preprocessed into a format suitable for model training. The preprocessing process mainly includes data cleaning, tokenization, and padding.<\/p>\n<p>In the case of the IMDB dataset, each review is in text format and has a positive or negative label. To input this data into the model, the text needs to be tokenized and ordered accordingly.<\/p>\n<pre><code>from transformers import AutoTokenizer\n\n# Load tokenizer for BERT\ntokenizer = AutoTokenizer.from_pretrained(\"bert-base-uncased\")\n\ndef preprocess_function(examples):\n    return tokenizer(examples['text'], truncation=True)\n\n# Apply preprocessing\ntokenized_datasets = dataset['train'].map(preprocess_function, batched=True)<\/code><\/pre>\n<p>The code above tokenizes the data according to the BERT model. The <code>truncation=True<\/code> parameter ensures that if the input data exceeds the maximum token length, it will be truncated. Through this process, each review is converted into a format understandable by the model.<\/p>\n<h3>3.4. Reviewing the Dataset<\/h3>\n<p>After completing the preprocessing steps, let&#8217;s review the dataset. We can check how it has been transformed:<\/p>\n<pre><code># Print the first item of the transformed dataset\nprint(tokenized_datasets[0])<\/code><\/pre>\n<h2>4. Splitting and Saving the Dataset<\/h2>\n<p>Before starting actual model training, it is essential to split the data into training and validation sets. This allows for setting a basis to evaluate the model&#8217;s generalization performance.<\/p>\n<pre><code>train_test_split = dataset['train'].train_test_split(test_size=0.2)\ntrain_dataset = train_test_split['train']\ntest_dataset = train_test_split['test']\n\n# Save datasets\ntrain_dataset.save_to_disk(\"train_dataset\")\ntest_dataset.save_to_disk(\"test_dataset\")<\/code><\/pre>\n<p>The code above assigns 20% of the total training data to the validation set and shows how to save the training and validation sets separately.<\/p>\n<h2>5. Examples of the Dataset<\/h2>\n<p>Now we are ready to proceed with training using the dataset we created. Here are some examples from the prepared IMDB dataset:<\/p>\n<pre class=\"example\"><code>This movie is great. <strong>Positive<\/strong>\nThis movie is really terrible. <strong>Negative<\/strong><\/code><\/pre>\n<p>Through these examples, the model will learn to distinguish between positive and negative reviews. Additionally, since tokenization is completed during the preprocessing phase, it can be directly used for model training.<\/p>\n<h2>6. Conclusion<\/h2>\n<p>In this article, we explored the overall process of preparing a dataset using the Hugging Face Transformers library. Data preparation is a foundational step in training deep learning models, emphasizing the importance of assembling high-quality datasets. Future posts will cover the process of training an actual model using the prepared dataset.<\/p>\n<p>In line with advancements in deep learning and NLP, Hugging Face will make your dataset preparation process much easier. Through continuous learning and experimentation, we encourage you to develop your own model.<\/p>\n<h3>References<\/h3>\n<ul>\n<li>Hugging Face Documentation: <a href=\"https:\/\/huggingface.co\/docs\/transformers\/index\" target=\"_blank\" rel=\"noopener\">https:\/\/huggingface.co\/docs\/transformers\/index<\/a><\/li>\n<li>Natural Language Processing with Transformers by Lewis Tunstall, Leandro von Werra, and Thomas Wolf<\/li>\n<\/ul>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The world of deep learning and natural language processing (NLP) is rapidly evolving, and within it, the Hugging Face Transformers library has become an essential tool for many researchers and developers. In this article, we will detail how to prepare a dataset using the Hugging Face Transformers library. Dataset preparation is the first step in &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36187\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;How to Use Hugging Face Transformers, Preparing Datasets&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36187","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How to Use Hugging Face Transformers, Preparing Datasets - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36187\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How to Use Hugging Face Transformers, Preparing Datasets - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"The world of deep learning and natural language processing (NLP) is rapidly evolving, and within it, the Hugging Face Transformers library has become an essential tool for many researchers and developers. In this article, we will detail how to prepare a dataset using the Hugging Face Transformers library. Dataset preparation is the first step in &hellip; \ub354 \ubcf4\uae30 &quot;How to Use Hugging Face Transformers, Preparing Datasets&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36187\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:46:29+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"4\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36187\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36187\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"How to Use Hugging Face Transformers, Preparing Datasets\",\"datePublished\":\"2024-11-01T09:46:29+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36187\/\"},\"wordCount\":697,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36187\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36187\/\",\"name\":\"How to Use Hugging Face Transformers, Preparing Datasets - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:46:29+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36187\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36187\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36187\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How to Use Hugging Face Transformers, Preparing Datasets\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"How to Use Hugging Face Transformers, Preparing Datasets - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36187\/","og_locale":"ko_KR","og_type":"article","og_title":"How to Use Hugging Face Transformers, Preparing Datasets - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"The world of deep learning and natural language processing (NLP) is rapidly evolving, and within it, the Hugging Face Transformers library has become an essential tool for many researchers and developers. In this article, we will detail how to prepare a dataset using the Hugging Face Transformers library. Dataset preparation is the first step in &hellip; \ub354 \ubcf4\uae30 \"How to Use Hugging Face Transformers, Preparing Datasets\"","og_url":"https:\/\/atmokpo.com\/w\/36187\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:46:29+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"4\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36187\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36187\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"How to Use Hugging Face Transformers, Preparing Datasets","datePublished":"2024-11-01T09:46:29+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36187\/"},"wordCount":697,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36187\/","url":"https:\/\/atmokpo.com\/w\/36187\/","name":"How to Use Hugging Face Transformers, Preparing Datasets - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:46:29+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36187\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36187\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36187\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"How to Use Hugging Face Transformers, Preparing Datasets"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36187","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36187"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36187\/revisions"}],"predecessor-version":[{"id":36188,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36187\/revisions\/36188"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36187"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36187"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36187"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}