Hugging Face Transformers Practical Course, Learning and Validation Dataset Split

The importance of Natural Language Processing (NLP) in the fields of Artificial Intelligence (AI) and Machine Learning is increasing day by day. At the center of this is the Hugging Face Transformer library. This library makes it easy to use various NLP models, especially with the advantage of being able to easily apply pre-trained models. In this course, we will teach you how to split training and validation datasets using the Hugging Face Transformer library.

1. Preparing the Dataset

The first step is to prepare the dataset to be used. Generally, a labeled dataset is required to solve NLP problems. In this example, we will use the IMDb Movie Reviews Dataset to train a model that classifies positive and negative reviews. This dataset is widely used and consists of the text of movie reviews and their corresponding labels (positive/negative).

1.1 Downloading the Dataset

python
from datasets import load_dataset

dataset = load_dataset("imdb")

You can download the IMDb dataset using the above code. The load_dataset function is one available in the Hugging Face datasets library, which allows you to easily download various public datasets.

1.2 Checking the Dataset Structure

python
print(dataset)

You can check the structure of the downloaded dataset. The dataset is divided into training (train), testing (test), and validation (validation) sets.

2. Splitting the Dataset

In general, it is important to split the data into several parts to train a model in machine learning. Typically, the training data and validation data are split, where the training data is used to train the model, and the validation data is used to evaluate its performance. In this case, we will extract a portion of the training data to use as validation data.

2.1 Splitting Training and Validation Data

python
from sklearn.model_selection import train_test_split

train_data = dataset['train']
train_texts = train_data['text']
train_labels = train_data['label']

train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts,
    train_labels,
    test_size=0.1,  # Using 10% as validation set
    random_state=42
)

The above code uses the train_test_split function to split the training data into 90% and 10%. Since test_size=0.1 is set, 10% of the original training data is chosen as validation data. The random_state parameter ensures the consistency of the split.

2.2 Checking the Split Data

python
print("Number of training samples:", len(train_texts))
print("Number of validation samples:", len(val_texts))

You can now check the number of training and validation samples. This information helps to determine whether our data has been properly split.

3. Preparing the Hugging Face Transformer Model

After splitting the dataset, we need to prepare the model. Hugging Face’s Transformer library provides a variety of pre-trained models, allowing us to choose a model suitable for our needs.

3.1 Selecting a Pre-trained Model

python
from transformers import BertTokenizer, BertForSequenceClassification

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

We prepare the BERT model using BertTokenizer and BertForSequenceClassification. This model is suitable for text classification tasks and uses the pre-trained version called “bert-base-uncased.”

3.2 Tokenizing the Data

python
train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors='pt')
val_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors='pt')

We tokenize the training and validation data using the tokenizer. truncation=True handles inputs that exceed length limits, and padding=True ensures all inputs are of equal length.

4. Training the Model

To train the model, we can manipulate the data in batches using PyTorch’s DataLoader. We will also set the optimizer and loss function to train the model.

4.1 Preparing the Data Loader

python
import torch
from torch.utils.data import DataLoader, Dataset

class IMDbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

A new dataset class is defined inheriting from the Dataset class, and we use DataLoader for batch processing. A batch size of 16 is used.

4.2 Setting Up Model Training

python
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

model.train()
for epoch in range(3):  # Total 3 epochs
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    print(f"Epoch: {epoch + 1}, Loss: {total_loss / len(train_loader)}")

We train the model using the AdamW optimization algorithm. The total loss is calculated and output for each epoch. In this example, training is done for 3 epochs.

5. Evaluating the Model

After training the model, we need to evaluate its performance on the validation data. This will help us determine how well the model generalizes.

5.1 Defining the Model Evaluation Function

python
from sklearn.metrics import accuracy_score

def evaluate_model(model, val_loader):
    model.eval()
    all_labels = []
    all_preds = []

    with torch.no_grad():
        for batch in val_loader:
            outputs = model(**batch)
            preds = outputs.logits.argmax(dim=-1)
            all_labels.extend(batch['labels'].tolist())
            all_preds.extend(preds.tolist())

    accuracy = accuracy_score(all_labels, all_preds)
    return accuracy

accuracy = evaluate_model(model, val_loader)
print("Validation Accuracy:", accuracy)

We define the evaluate_model function to assess the model’s performance. The accuracy on the validation data is printed to gauge the model’s performance.

6. Conclusion

In this course, we learned how to handle the IMDb movie reviews dataset using Hugging Face’s Transformer library. We looked at the entire process from splitting the dataset, training the model, and evaluating its performance. Through this process, we hope you gained a fundamental understanding of the NLP field. These techniques can be applied to various language models, enabling you to achieve better results.