Using Hugging Face Transformers, Creating Dataset Class

In this course, we will cover how to create a dataset class using the Hugging Face Transformers library. Hugging Face Transformers is one of the essential tools for natural language processing (NLP) tasks, providing various pre-trained models and datasets. To effectively utilize these models, it is important to create custom datasets.

1. What are Hugging Face Transformers?

The Hugging Face Transformers library is one of the most widely used libraries in the field of natural language processing in machine learning, including the latest models such as BERT, GPT-2, and T5. This library helps implement model training, fine-tuning, predictions, and more easily.

2. What is a Dataset Class?

A dataset class is a class that defines the structure of the data used for model training and evaluation. By using a dataset class, you can easily load and preprocess custom data. Hugging Face Transformers provides features to handle data easily through the datasets library.

3. How to Create a Dataset Class

In this section, we will discuss how to create a dataset class using Python. Specifically, we will explain how to inherit from the torch.utils.data.Dataset class to create a custom dataset class and integrate it with Hugging Face Transformers.

3.1 Getting Started

First, install and import the required libraries. Use the code below to install the transformers and datasets libraries.

!pip install transformers datasets torch

3.2 Creating a Custom Dataset Class

Here, we will show you how to create a dataset class.

import torch
from torch.utils.data import Dataset

class MyDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]

        # Tokenization and index conversion
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

The above class is a dataset that takes text and labels as input, tokenizes the text, and returns labels as tensors. It inherits from the torch.utils.data.Dataset class and implements the __len__ and __getitem__ methods.

3.3 Using the Dataset

Now let’s look at how to use the custom dataset. Here’s an example of how to prepare data and create a data loader.

from transformers import AutoTokenizer

# Prepare the tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Prepare the data
texts = ["Hello, how are you?", "I am fine, thank you."]
labels = [0, 1] # Example labels

# Create a dataset instance
dataset = MyDataset(texts, labels, tokenizer)

# Create a data loader
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

for batch in dataloader:
    print(batch)

The above code creates a small dataset and then generates a data loader to return data in batches. The data loader allows random selection of data samples during training and returns them in batches.

4. Extending the Dataset Class

Now I will show you how to extend the custom dataset class to add more features. For example, you can include additional data preprocessing steps or handle multiple input formats.

4.1 Data Preprocessing

Data preprocessing is a crucial step in improving model performance. If necessary, you can implement preprocessing functionality in the __init__ method.

def preprocess(self, text):
        # Add preprocessing logic here
        return text.lower().strip()

You can call this method in __getitem__ to perform preprocessing before returning the data.

4.2 Handling Multiple Input Formats

If the dataset needs to handle various input formats, you can use conditional statements to process them differently. Just add conditions based on the format of the input text.

if isinstance(text, list):
    text = " ".join(text)  # Join list texts

5. Conclusion

In this course, we learned how to create and use dataset classes in Hugging Face Transformers. Custom datasets are essential elements in training and evaluating models. Through this, we can efficiently process various formatted data and train models in our desired manner.

In the future, make sure to utilize Hugging Face to solve more natural language processing problems. Also, try creating your own dataset class to build your skills. Thank you!

How to Use Hugging Face Transformers, Preparing Datasets

The world of deep learning and natural language processing (NLP) is rapidly evolving, and within it, the Hugging Face Transformers library has become an essential tool for many researchers and developers. In this article, we will detail how to prepare a dataset using the Hugging Face Transformers library. Dataset preparation is the first step in model training, and high-quality data is crucial for achieving good results.

1. What is Hugging Face Transformers?

The Transformers library from Hugging Face is an open-source library designed to make it easy to use natural language processing models. This library provides a variety of pre-trained models and datasets, giving researchers a foundation to design and experiment with new models. It has a significant advantage in that it allows access to the latest NLP models at a low cost.

2. The Importance of Dataset Preparation

The performance of a model largely depends on the quality of the dataset. A well-structured dataset facilitates the training process of the model, and the diversity and quantity of the data significantly affect the model’s ability to generalize. Therefore, during the dataset preparation phase, the following considerations should be made:

  • Data Quality: It is important to use data with minimal duplicates and noise.
  • Data Diversity: The model must include various situations and cases to perform well in real-world environments.
  • Data Size: The more data available, the higher the model’s ability to generalize during training.

3. Downloading and Preparing the Dataset

Hugging Face provides various public datasets. Using these datasets allows for easy access to the data needed for model training. Now, let’s look at how to load and preprocess the dataset.

3.1. Installing the Hugging Face Datasets Library

First, you need to install the Datasets library from Hugging Face:

pip install datasets

3.2. Loading the Dataset

Now, let’s learn how to load Hugging Face datasets in Python. For example, we will use the IMDB movie reviews dataset.

from datasets import load_dataset

# Load IMDB dataset
dataset = load_dataset("imdb")

print(dataset)

Running the above code will load the dataset split into training and test sets. Next, here is how to check the structure of the dataset:

# Print the first item of the dataset
print(dataset['train'][0])

3.3. Preprocessing the Dataset

After loading the dataset, it needs to be preprocessed into a format suitable for model training. The preprocessing process mainly includes data cleaning, tokenization, and padding.

In the case of the IMDB dataset, each review is in text format and has a positive or negative label. To input this data into the model, the text needs to be tokenized and ordered accordingly.

from transformers import AutoTokenizer

# Load tokenizer for BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True)

# Apply preprocessing
tokenized_datasets = dataset['train'].map(preprocess_function, batched=True)

The code above tokenizes the data according to the BERT model. The truncation=True parameter ensures that if the input data exceeds the maximum token length, it will be truncated. Through this process, each review is converted into a format understandable by the model.

3.4. Reviewing the Dataset

After completing the preprocessing steps, let’s review the dataset. We can check how it has been transformed:

# Print the first item of the transformed dataset
print(tokenized_datasets[0])

4. Splitting and Saving the Dataset

Before starting actual model training, it is essential to split the data into training and validation sets. This allows for setting a basis to evaluate the model’s generalization performance.

train_test_split = dataset['train'].train_test_split(test_size=0.2)
train_dataset = train_test_split['train']
test_dataset = train_test_split['test']

# Save datasets
train_dataset.save_to_disk("train_dataset")
test_dataset.save_to_disk("test_dataset")

The code above assigns 20% of the total training data to the validation set and shows how to save the training and validation sets separately.

5. Examples of the Dataset

Now we are ready to proceed with training using the dataset we created. Here are some examples from the prepared IMDB dataset:

This movie is great. Positive
This movie is really terrible. Negative

Through these examples, the model will learn to distinguish between positive and negative reviews. Additionally, since tokenization is completed during the preprocessing phase, it can be directly used for model training.

6. Conclusion

In this article, we explored the overall process of preparing a dataset using the Hugging Face Transformers library. Data preparation is a foundational step in training deep learning models, emphasizing the importance of assembling high-quality datasets. Future posts will cover the process of training an actual model using the prepared dataset.

In line with advancements in deep learning and NLP, Hugging Face will make your dataset preparation process much easier. Through continuous learning and experimentation, we encourage you to develop your own model.

References

Using Hugging Face Transformers, Loading Pre-trained BERT Model for Multi-class Classification

Loading a Pre-trained BERT Model for Multi-class Classification

BERT (Bidirectional Encoder Representations from Transformers) is a natural language processing model proposed by Google that utilizes a bidirectional Transformer architecture for contextual understanding. BERT can be applied to various natural language processing tasks through pre-training and fine-tuning stages. In this tutorial, we will introduce how to load a pre-trained BERT model using the Hugging Face Transformers library to solve a multi-class classification problem.

1. Environment Setup

This tutorial requires the following libraries:

  • transformers
  • torch (PyTorch)
  • numpy
  • pandas

You can install the required libraries using the following command:

!pip install transformers torch numpy pandas

2. Preparing the Data

First, we need to prepare a dataset for the multi-class classification problem. As an example, let’s create a simple dataframe with text and labels.

import pandas as pd

data = {
    'text': [
        'I like natural language processing.',
        'PyTorch and TensorFlow are popular.',
        'Deep learning is a field of machine learning.',
        'Conversational AI is gaining a lot of attention.',
        'Text classification is an important task.'
    ],
    'label': [0, 1, 1, 2, 0]
}

df = pd.DataFrame(data)

3. Data Preprocessing

Prepare the data in the format required by the BERT model. We use the BERT Tokenizer to tokenize the text and generate input IDs and attention masks.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenization and generating input IDs and attention masks
def encode_data(text):
    return tokenizer(text, padding='max_length', truncation=True, max_length=128, return_tensors='pt')

encoded_texts = [encode_data(text)['input_ids'] for text in df['text']]
attention_masks = [encode_data(text)['attention_mask'] for text in df['text']]

4. Splitting the Dataset

We split the data into training and validation sets. Here, we will use 80% of the data for training and 20% for validation.

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    df['text'],
    df['label'],
    test_size=0.2,
    random_state=42
)

5. Creating Data Loaders

Using PyTorch’s DataLoader, we create data loaders for batch processing.

import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = encode_data(self.texts[idx])
        return {
            'input_ids': text['input_ids'].squeeze(),
            'attention_mask': text['attention_mask'].squeeze(),
            'labels': torch.tensor(self.labels[idx])
        }

train_dataset = TextDataset(X_train.tolist(), y_train.tolist())
val_dataset = TextDataset(X_val.tolist(), y_val.tolist())

train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=2, shuffle=False)

6. Loading the Model

Load the pre-trained BERT model from Hugging Face’s Transformers library. We will add a classifier here to address the multi-class classification problem.

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=3)

7. Training the Model

To train the model, we set up a loss function and optimization algorithm, and create a simple training loop.

from transformers import AdamW
from tqdm import tqdm

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

optimizer = AdamW(model.parameters(), lr=1e-5)

# Model training
model.train()
for epoch in range(3):  # Number of epochs
    for batch in tqdm(train_loader):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

    print(f'Epoch {epoch+1} Loss: {loss.item()}')

8. Validation and Performance Evaluation

We evaluate the model’s performance using the validation data. Here we measure the accuracy.

model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in val_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        _, predicted = torch.max(outputs.logits, dim=1)

        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = correct / total
print(f'Accuracy: {accuracy:.2f}') 

9. Conclusion

In this tutorial, we learned how to utilize a pre-trained BERT model for multi-class classification problems using the Hugging Face Transformers library. BERT demonstrates powerful performance, making it applicable to many natural language processing problems you may want to analyze. In real projects, you should achieve optimal results through various experiments and tuning processes. Transformer models are rapidly advancing, so continuous learning is necessary.

If you have any further questions, feel free to ask!

Hugging Face Transformers Practical Course, Google Colab Environment Setup

With the advances in deep learning and natural language processing (NLP), efficient and powerful transformation models have emerged. One of them is the Hugging Face Transformers library. In this course, we will explain how to use Hugging Face’s Transformers library in the Google Colab environment, along with basic examples and practical code utilization.

1. What is Hugging Face Transformers?

The Hugging Face Transformers library is an open-source library that provides various state-of-the-art natural language processing (NLP) models. Models such as BERT, GPT-2, RoBERTa, and T5 can be easily used, and these models are pre-trained, allowing for high performance even with limited data. This library supports two deep learning frameworks: PyTorch and TensorFlow.

2. Overview of Google Colab

Google Colaboratory is a cloud-based Jupyter notebook service. It provides free GPU resources, making it a very useful environment for training and executing deep learning models. Through this course, we will learn how to use Hugging Face’s Transformers library by leveraging Google Colab.

3. Setting Up Google Colab Environment

3.1 Accessing Google Colab

To access Google Colab, visit Google Colab in your web browser. Logging in with your Google account will bring up a screen where you can create a new notebook.

3.2 Creating a New Notebook

Click the ‘New Notebook’ button in the upper right corner to create a new Jupyter notebook. Set a name for the notebook to distinguish your work.

3.3 Setting Runtime Type

Google Colab allows you to train models using a GPU. To do this, select Runtime -> Change runtime type from the top menu. Choose ‘GPU’ under ‘Hardware accelerator’ and then click the Save button.

4. Installing Hugging Face Transformers Library

Now, we need to install the Hugging Face Transformers library in the Colab environment. Enter and run the code below to install the library.

!pip install transformers

5. Basic Usage Example

Once the installation is complete, we will perform a text classification task using Hugging Face’s Transformers library.

5.1 Importing the Library and Initializing the Model

import torch
from transformers import BertTokenizer, BertForSequenceClassification

# Initialize BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

5.2 Tokenizing Input Sentence and Performing Prediction

Define the sentence to be input into the model, tokenize it, and perform the prediction.

# Define input sentence
input_sentence = "I love programming with Python!"

# Tokenize the sentence
inputs = tokenizer(input_sentence, return_tensors="pt")

# Model prediction
with torch.no_grad():
    logits = model(**inputs).logits

# Output prediction results
predicted_class = torch.argmax(logits, dim=1)
print(f"Predicted class: {predicted_class.item()}")

6. Conclusion

In this course, we learned how to use Hugging Face’s Transformers library in Google Colab. We went through all the processes from setting up the environment to basic text classification examples, learning how to easily utilize NLP models. The Hugging Face Transformers library offers many other features, allowing for various projects to be undertaken based on it.

7. References

Transformers Course by Hugging Face, Sentiment Analysis

In this article, we will learn how to perform sentiment analysis using the Hugging Face Transformers library, which is frequently used in Natural Language Processing (NLP). Sentiment analysis is a technique for extracting emotions or sentiments from text data and is widely used in various fields.

1. What are Hugging Face Transformers?

The Hugging Face Transformers library is a Python library that allows easy access to various pre-trained Natural Language Processing models. It supports multiple types of models, including BERT, GPT-2, T5, and is especially easy to fine-tune, enabling the adjustment of models for various tasks.

2. Overview of Sentiment Analysis

Sentiment analysis primarily includes tasks such as:

  • The overall emotional state of a document (positive, negative, neutral)
  • Detailed sentiments of product reviews
  • Tracking emotions in social media posts

Sentiment analysis can be implemented using machine learning and deep learning techniques, and the quality and quantity of training data greatly influence the results.

3. Setting Up the Environment

We will install the necessary libraries to proceed with this tutorial. Use the following command to install Hugging Face Transformers and the tokenization libraries transformers and torch.

pip install transformers torch

4. Preparing the Dataset

We will use the famous IMDb Movie Review Dataset as our dataset for sentiment analysis. This dataset contains positive and negative reviews about movies.

from sklearn.datasets import fetch_openml

data = fetch_openml('IMDb', version=1)
texts, labels = data['data'], data['target']

5. Data Preprocessing

We will preprocess the data to prepare it for input into the model. This involves cleaning the text and converting the labels into numbers.

import pandas as pd

df = pd.DataFrame({'text': texts, 'label': labels})
df['label'] = df['label'].apply(lambda x: 1 if x == 'pos' else 0)
texts = df['text'].tolist()
labels = df['label'].tolist()

6. Loading the Model

We will load a pre-trained BERT model for sentiment analysis. Additionally, we will tokenize the text and convert it into a format suitable for model input.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

7. Text Tokenization

We tokenize the text so that it can be input into the model. This process involves transforming each review into an appropriate format for the model.

encodings = tokenizer(texts, truncation=True, padding=True, max_length=128, return_tensors="pt")

8. Model Training

To train the model, we need to perform fine-tuning on the given data. Now, we will set up the dataset using PyTorch’s data loader.

import torch
from torch.utils.data import DataLoader, Dataset

class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

dataset = SentimentDataset(encodings, labels)
train_loader = DataLoader(dataset, batch_size=16, shuffle=True)

9. Model Training

We define the loss function and optimizer to train the model, proceeding with training over multiple epochs.

from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

model.train()
for epoch in range(3):
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        print(f'Epoch: {epoch}, Loss: {loss.item()}')

10. Model Evaluation

We will evaluate the model to check its performance. We will measure accuracy and loss using the validation dataset.

model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in train_loader:
        outputs = model(**batch)
        predictions = outputs.logits.argmax(dim=-1)
        correct += (predictions == batch['labels']).sum().item()
        total += batch['labels'].size(0)

accuracy = correct / total
print(f'Accuracy: {accuracy}')

11. Making Predictions

Once the model is trained, we can make predictions on new data. Below is an example code to make actual predictions.

def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    prediction = outputs.logits.argmax(dim=-1)
    return 'Positive' if prediction.item() == 1 else 'Negative'

test_text = "This movie was really enjoyable!"
print(f'Prediction: {predict_sentiment(test_text)}')

12. Conclusion

In this article, we explored the entire process of performing sentiment analysis using the Hugging Face Transformers library. Through fine-tuning the model and predicting real data, we were able to verify the potential applications of deep learning models. We can expect to apply Hugging Face Transformers to various Natural Language Processing tasks in the future.

13. References