The importance of Natural Language Processing (NLP) in the fields of Artificial Intelligence (AI) and Machine Learning is increasing day by day. At the center of this is the Hugging Face Transformer library. This library makes it easy to use various NLP models, especially with the advantage of being able to easily apply pre-trained models. In this course, we will teach you how to split training and validation datasets using the Hugging Face Transformer library.
1. Preparing the Dataset
The first step is to prepare the dataset to be used. Generally, a labeled dataset is required to solve NLP problems. In this example, we will use the IMDb Movie Reviews Dataset to train a model that classifies positive and negative reviews. This dataset is widely used and consists of the text of movie reviews and their corresponding labels (positive/negative).
1.1 Downloading the Dataset
python
from datasets import load_dataset
dataset = load_dataset("imdb")
You can download the IMDb dataset using the above code. The load_dataset
function is one available in the Hugging Face datasets library, which allows you to easily download various public datasets.
1.2 Checking the Dataset Structure
python
print(dataset)
You can check the structure of the downloaded dataset. The dataset is divided into training (train), testing (test), and validation (validation) sets.
2. Splitting the Dataset
In general, it is important to split the data into several parts to train a model in machine learning. Typically, the training data and validation data are split, where the training data is used to train the model, and the validation data is used to evaluate its performance. In this case, we will extract a portion of the training data to use as validation data.
2.1 Splitting Training and Validation Data
python
from sklearn.model_selection import train_test_split
train_data = dataset['train']
train_texts = train_data['text']
train_labels = train_data['label']
train_texts, val_texts, train_labels, val_labels = train_test_split(
train_texts,
train_labels,
test_size=0.1, # Using 10% as validation set
random_state=42
)
The above code uses the train_test_split
function to split the training data into 90% and 10%. Since test_size=0.1
is set, 10% of the original training data is chosen as validation data. The random_state
parameter ensures the consistency of the split.
2.2 Checking the Split Data
python
print("Number of training samples:", len(train_texts))
print("Number of validation samples:", len(val_texts))
You can now check the number of training and validation samples. This information helps to determine whether our data has been properly split.
3. Preparing the Hugging Face Transformer Model
After splitting the dataset, we need to prepare the model. Hugging Face’s Transformer library provides a variety of pre-trained models, allowing us to choose a model suitable for our needs.
3.1 Selecting a Pre-trained Model
python
from transformers import BertTokenizer, BertForSequenceClassification
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)
We prepare the BERT model using BertTokenizer
and BertForSequenceClassification
. This model is suitable for text classification tasks and uses the pre-trained version called “bert-base-uncased.”
3.2 Tokenizing the Data
python
train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors='pt')
val_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors='pt')
We tokenize the training and validation data using the tokenizer
. truncation=True
handles inputs that exceed length limits, and padding=True
ensures all inputs are of equal length.
4. Training the Model
To train the model, we can manipulate the data in batches using PyTorch’s DataLoader. We will also set the optimizer and loss function to train the model.
4.1 Preparing the Data Loader
python
import torch
from torch.utils.data import DataLoader, Dataset
class IMDbDataset(Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: val[idx] for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)
A new dataset class is defined inheriting from the Dataset
class, and we use DataLoader
for batch processing. A batch size of 16 is used.
4.2 Setting Up Model Training
python
from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)
model.train()
for epoch in range(3): # Total 3 epochs
total_loss = 0
for batch in train_loader:
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss
total_loss += loss.item()
loss.backward()
optimizer.step()
print(f"Epoch: {epoch + 1}, Loss: {total_loss / len(train_loader)}")
We train the model using the AdamW optimization algorithm. The total loss is calculated and output for each epoch. In this example, training is done for 3 epochs.
5. Evaluating the Model
After training the model, we need to evaluate its performance on the validation data. This will help us determine how well the model generalizes.
5.1 Defining the Model Evaluation Function
python
from sklearn.metrics import accuracy_score
def evaluate_model(model, val_loader):
model.eval()
all_labels = []
all_preds = []
with torch.no_grad():
for batch in val_loader:
outputs = model(**batch)
preds = outputs.logits.argmax(dim=-1)
all_labels.extend(batch['labels'].tolist())
all_preds.extend(preds.tolist())
accuracy = accuracy_score(all_labels, all_preds)
return accuracy
accuracy = evaluate_model(model, val_loader)
print("Validation Accuracy:", accuracy)
We define the evaluate_model
function to assess the model’s performance. The accuracy on the validation data is printed to gauge the model’s performance.
6. Conclusion
In this course, we learned how to handle the IMDb movie reviews dataset using Hugging Face’s Transformer library. We looked at the entire process from splitting the dataset, training the model, and evaluating its performance. Through this process, we hope you gained a fundamental understanding of the NLP field. These techniques can be applied to various language models, enabling you to achieve better results.