{"id":36247,"date":"2024-11-01T09:46:59","date_gmt":"2024-11-01T09:46:59","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36247"},"modified":"2024-11-01T09:46:59","modified_gmt":"2024-11-01T09:46:59","slug":"hugging-face-transformers-practical-course-learning-and-validation-dataset-split","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36247\/","title":{"rendered":"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split"},"content":{"rendered":"<p><body><\/p>\n<p>\n    The importance of Natural Language Processing (NLP) in the fields of Artificial Intelligence (AI) and Machine Learning is increasing day by day. At the center of this is the <strong>Hugging Face<\/strong> Transformer library. This library makes it easy to use various NLP models, especially with the advantage of being able to easily apply pre-trained models. In this course, we will teach you how to split training and validation datasets using the Hugging Face Transformer library.\n<\/p>\n<h2>1. Preparing the Dataset<\/h2>\n<p>\n    The first step is to prepare the dataset to be used. Generally, a labeled dataset is required to solve NLP problems. In this example, we will use the <strong>IMDb Movie Reviews Dataset<\/strong> to train a model that classifies positive and negative reviews. This dataset is widely used and consists of the text of movie reviews and their corresponding labels (positive\/negative).\n<\/p>\n<h3>1.1 Downloading the Dataset<\/h3>\n<pre><code>python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"imdb\")\n<\/code><\/pre>\n<p>\n    You can download the IMDb dataset using the above code. The <code>load_dataset<\/code> function is one available in the Hugging Face datasets library, which allows you to easily download various public datasets.\n<\/p>\n<h3>1.2 Checking the Dataset Structure<\/h3>\n<pre><code>python\nprint(dataset)\n<\/code><\/pre>\n<p>\n    You can check the structure of the downloaded dataset. The dataset is divided into training (train), testing (test), and validation (validation) sets.\n<\/p>\n<h2>2. Splitting the Dataset<\/h2>\n<p>\n    In general, it is important to split the data into several parts to train a model in machine learning. Typically, the training data and validation data are split, where the training data is used to train the model, and the validation data is used to evaluate its performance. In this case, we will extract a portion of the training data to use as validation data.\n<\/p>\n<h3>2.1 Splitting Training and Validation Data<\/h3>\n<pre><code>python\nfrom sklearn.model_selection import train_test_split\n\ntrain_data = dataset['train']\ntrain_texts = train_data['text']\ntrain_labels = train_data['label']\n\ntrain_texts, val_texts, train_labels, val_labels = train_test_split(\n    train_texts,\n    train_labels,\n    test_size=0.1,  # Using 10% as validation set\n    random_state=42\n)\n<\/code><\/pre>\n<p>\n    The above code uses the <code>train_test_split<\/code> function to split the training data into 90% and 10%. Since <code>test_size=0.1<\/code> is set, 10% of the original training data is chosen as validation data. The <code>random_state<\/code> parameter ensures the consistency of the split.\n<\/p>\n<h3>2.2 Checking the Split Data<\/h3>\n<pre><code>python\nprint(\"Number of training samples:\", len(train_texts))\nprint(\"Number of validation samples:\", len(val_texts))\n<\/code><\/pre>\n<p>\n    You can now check the number of training and validation samples. This information helps to determine whether our data has been properly split.\n<\/p>\n<h2>3. Preparing the Hugging Face Transformer Model<\/h2>\n<p>\n    After splitting the dataset, we need to prepare the model. Hugging Face&#8217;s <strong>Transformer library<\/strong> provides a variety of pre-trained models, allowing us to choose a model suitable for our needs.\n<\/p>\n<h3>3.1 Selecting a Pre-trained Model<\/h3>\n<pre><code>python\nfrom transformers import BertTokenizer, BertForSequenceClassification\n\nmodel_name = \"bert-base-uncased\"\ntokenizer = BertTokenizer.from_pretrained(model_name)\nmodel = BertForSequenceClassification.from_pretrained(model_name)\n<\/code><\/pre>\n<p>\n    We prepare the BERT model using <code>BertTokenizer<\/code> and <code>BertForSequenceClassification<\/code>. This model is suitable for text classification tasks and uses the pre-trained version called &#8220;bert-base-uncased.&#8221;\n<\/p>\n<h3>3.2 Tokenizing the Data<\/h3>\n<pre><code>python\ntrain_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors='pt')\nval_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors='pt')\n<\/code><\/pre>\n<p>\n    We tokenize the training and validation data using the <code>tokenizer<\/code>. <code>truncation=True<\/code> handles inputs that exceed length limits, and <code>padding=True<\/code> ensures all inputs are of equal length.\n<\/p>\n<h2>4. Training the Model<\/h2>\n<p>\n    To train the model, we can manipulate the data in batches using PyTorch&#8217;s <strong>DataLoader<\/strong>. We will also set the optimizer and loss function to train the model.\n<\/p>\n<h3>4.1 Preparing the Data Loader<\/h3>\n<pre><code>python\nimport torch\nfrom torch.utils.data import DataLoader, Dataset\n\nclass IMDbDataset(Dataset):\n    def __init__(self, encodings, labels):\n        self.encodings = encodings\n        self.labels = labels\n\n    def __getitem__(self, idx):\n        item = {key: val[idx] for key, val in self.encodings.items()}\n        item['labels'] = torch.tensor(self.labels[idx])\n        return item\n\n    def __len__(self):\n        return len(self.labels)\n\ntrain_dataset = IMDbDataset(train_encodings, train_labels)\nval_dataset = IMDbDataset(val_encodings, val_labels)\n\ntrain_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)\nval_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)\n<\/code><\/pre>\n<p>\n    A new dataset class is defined inheriting from the <code>Dataset<\/code> class, and we use <code>DataLoader<\/code> for batch processing. A batch size of 16 is used.\n<\/p>\n<h3>4.2 Setting Up Model Training<\/h3>\n<pre><code>python\nfrom transformers import AdamW\n\noptimizer = AdamW(model.parameters(), lr=5e-5)\n\nmodel.train()\nfor epoch in range(3):  # Total 3 epochs\n    total_loss = 0\n    for batch in train_loader:\n        optimizer.zero_grad()\n        outputs = model(**batch)\n        loss = outputs.loss\n        total_loss += loss.item()\n        loss.backward()\n        optimizer.step()\n    print(f\"Epoch: {epoch + 1}, Loss: {total_loss \/ len(train_loader)}\")\n<\/code><\/pre>\n<p>\n    We train the model using the AdamW optimization algorithm. The total loss is calculated and output for each epoch. In this example, training is done for 3 epochs.\n<\/p>\n<h2>5. Evaluating the Model<\/h2>\n<p>\n    After training the model, we need to evaluate its performance on the validation data. This will help us determine how well the model generalizes.\n<\/p>\n<h3>5.1 Defining the Model Evaluation Function<\/h3>\n<pre><code>python\nfrom sklearn.metrics import accuracy_score\n\ndef evaluate_model(model, val_loader):\n    model.eval()\n    all_labels = []\n    all_preds = []\n\n    with torch.no_grad():\n        for batch in val_loader:\n            outputs = model(**batch)\n            preds = outputs.logits.argmax(dim=-1)\n            all_labels.extend(batch['labels'].tolist())\n            all_preds.extend(preds.tolist())\n\n    accuracy = accuracy_score(all_labels, all_preds)\n    return accuracy\n\naccuracy = evaluate_model(model, val_loader)\nprint(\"Validation Accuracy:\", accuracy)\n<\/code><\/pre>\n<p>\n    We define the <code>evaluate_model<\/code> function to assess the model&#8217;s performance. The accuracy on the validation data is printed to gauge the model&#8217;s performance.\n<\/p>\n<h2>6. Conclusion<\/h2>\n<p>\n    In this course, we learned how to handle the IMDb movie reviews dataset using Hugging Face&#8217;s Transformer library. We looked at the entire process from splitting the dataset, training the model, and evaluating its performance. Through this process, we hope you gained a fundamental understanding of the NLP field. These techniques can be applied to various language models, enabling you to achieve better results.\n<\/p>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The importance of Natural Language Processing (NLP) in the fields of Artificial Intelligence (AI) and Machine Learning is increasing day by day. At the center of this is the Hugging Face Transformer library. This library makes it easy to use various NLP models, especially with the advantage of being able to easily apply pre-trained models. &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36247\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Hugging Face Transformers Practical Course, Learning and Validation Dataset Split&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[108],"tags":[],"class_list":["post-36247","post","type-post","status-publish","format-standard","hentry","category---en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Hugging Face Transformers Practical Course, Learning and Validation Dataset Split - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36247\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"The importance of Natural Language Processing (NLP) in the fields of Artificial Intelligence (AI) and Machine Learning is increasing day by day. At the center of this is the Hugging Face Transformer library. This library makes it easy to use various NLP models, especially with the advantage of being able to easily apply pre-trained models. &hellip; \ub354 \ubcf4\uae30 &quot;Hugging Face Transformers Practical Course, Learning and Validation Dataset Split&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36247\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:46:59+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"5\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36247\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36247\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split\",\"datePublished\":\"2024-11-01T09:46:59+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36247\/\"},\"wordCount\":641,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"Using Hugging Face\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36247\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36247\/\",\"name\":\"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:46:59+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36247\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36247\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36247\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36247\/","og_locale":"ko_KR","og_type":"article","og_title":"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"The importance of Natural Language Processing (NLP) in the fields of Artificial Intelligence (AI) and Machine Learning is increasing day by day. At the center of this is the Hugging Face Transformer library. This library makes it easy to use various NLP models, especially with the advantage of being able to easily apply pre-trained models. &hellip; \ub354 \ubcf4\uae30 \"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split\"","og_url":"https:\/\/atmokpo.com\/w\/36247\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:46:59+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"5\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36247\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36247\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split","datePublished":"2024-11-01T09:46:59+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36247\/"},"wordCount":641,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["Using Hugging Face"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36247\/","url":"https:\/\/atmokpo.com\/w\/36247\/","name":"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:46:59+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36247\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36247\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36247\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Hugging Face Transformers Practical Course, Learning and Validation Dataset Split"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36247","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36247"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36247\/revisions"}],"predecessor-version":[{"id":36248,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36247\/revisions\/36248"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36247"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36247"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36247"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}