This course will explain how to perform natural language processing (NLP) tasks using the Hugging Face Transformers library and discuss how to evaluate the accuracy of the models. Hugging Face provides various pre-trained models that can be easily utilized for NLP tasks.
1. What is Hugging Face Transformers?
The Hugging Face Transformers library is a Python library that offers a variety of state-of-the-art NLP models. Models can be trained through unsupervised and supervised learning, allowing for easy use of transfer learning models like BERT, GPT, and T5.
1.1. Key Features
- Easy to download pre-trained models.
- Compatible with PyTorch and TensorFlow.
- Provides a simple API for various NLP tasks.
2. Setting Up the Environment
Install the necessary libraries to run the code. Use the command below to install transformers and related libraries:
pip install transformers torch
3. Preparing the Dataset
Now let’s prepare the data for simple sentiment analysis. The data consists of positive and negative reviews.
import pandas as pd
data = {
"text": ["This movie was really fun!", "It was the worst movie.", "Amazing storyline!", "I never want to see it again."],
"label": [1, 0, 1, 0] # 1: positive, 0: negative
}
df = pd.DataFrame(data)
print(df)
4. Loading the Model and Data Preprocessing
We will use the Hugging Face library to load a pre-trained BERT model and preprocess the data.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize_function(examples):
return tokenizer(examples['text'], padding="max_length", truncation=True)
tokenized_data = df.apply(tokenize_function, axis=1)
print(tokenized_data.head())
5. Training the Model
We will set up a training loop using PyTorch to train the model.
import torch
from transformers import BertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
class ReviewDataset(Dataset):
def __init__(self, texts, labels):
self.texts = texts
self.labels = labels
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
return {
'input_ids': self.texts[idx]['input_ids'],
'attention_mask': self.texts[idx]['attention_mask'],
'labels': torch.tensor(self.labels[idx])
}
# Create dataset
dataset = ReviewDataset(tokenized_data.tolist(), df['label'].tolist())
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
# Load the model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Set up optimizer for training
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
# Train the model
model.train()
for epoch in range(3): # Number of epochs
for batch in dataloader:
optimizer.zero_grad()
outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['labels'])
loss = outputs.loss
loss.backward()
optimizer.step()
print("Model training complete!")
6. Evaluating Accuracy
To evaluate the model’s accuracy, we will prepare a test dataset and perform predictions.
from sklearn.metrics import accuracy_score
# Test dataset (example generated arbitrarily)
test_data = {
"text": ["This movie exceeded my expectations!", "It was too boring and a sad story."],
"label": [1, 0]
}
test_df = pd.DataFrame(test_data)
test_tokenized = test_df.apply(tokenize_function, axis=1)
# Perform predictions
model.eval()
predictions = []
with torch.no_grad():
for test_input in test_tokenized:
outputs = model(input_ids=test_input['input_ids'], attention_mask=test_input['attention_mask'])
predictions.append(torch.argmax(outputs.logits, dim=-1).item())
# Calculate accuracy
accuracy = accuracy_score(test_df['label'], predictions)
print(f"Model accuracy: {accuracy * 100:.2f}%")
7. Conclusion
Using the Hugging Face Transformers library, you can easily and quickly perform natural language processing (NLP) tasks. In particular, pre-trained models allow you to achieve good performance even with small datasets. The process of evaluating accuracy and understanding model performance is an important part of learning deep learning.