Transformers Course by Hugging Face, Sentiment Analysis

In this article, we will learn how to perform sentiment analysis using the Hugging Face Transformers library, which is frequently used in Natural Language Processing (NLP). Sentiment analysis is a technique for extracting emotions or sentiments from text data and is widely used in various fields.

1. What are Hugging Face Transformers?

The Hugging Face Transformers library is a Python library that allows easy access to various pre-trained Natural Language Processing models. It supports multiple types of models, including BERT, GPT-2, T5, and is especially easy to fine-tune, enabling the adjustment of models for various tasks.

2. Overview of Sentiment Analysis

Sentiment analysis primarily includes tasks such as:

The overall emotional state of a document (positive, negative, neutral)
Detailed sentiments of product reviews
Tracking emotions in social media posts

Sentiment analysis can be implemented using machine learning and deep learning techniques, and the quality and quantity of training data greatly influence the results.

3. Setting Up the Environment

We will install the necessary libraries to proceed with this tutorial. Use the following command to install Hugging Face Transformers and the tokenization libraries transformers and torch.

pip install transformers torch

4. Preparing the Dataset

We will use the famous IMDb Movie Review Dataset as our dataset for sentiment analysis. This dataset contains positive and negative reviews about movies.

from sklearn.datasets import fetch_openml

data = fetch_openml('IMDb', version=1)
texts, labels = data['data'], data['target']

5. Data Preprocessing

We will preprocess the data to prepare it for input into the model. This involves cleaning the text and converting the labels into numbers.

import pandas as pd

df = pd.DataFrame({'text': texts, 'label': labels})
df['label'] = df['label'].apply(lambda x: 1 if x == 'pos' else 0)
texts = df['text'].tolist()
labels = df['label'].tolist()

6. Loading the Model

We will load a pre-trained BERT model for sentiment analysis. Additionally, we will tokenize the text and convert it into a format suitable for model input.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = 'nlptown/bert-base-multilingual-uncased-sentiment'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

7. Text Tokenization

We tokenize the text so that it can be input into the model. This process involves transforming each review into an appropriate format for the model.

encodings = tokenizer(texts, truncation=True, padding=True, max_length=128, return_tensors="pt")

8. Model Training

To train the model, we need to perform fine-tuning on the given data. Now, we will set up the dataset using PyTorch’s data loader.

import torch
from torch.utils.data import DataLoader, Dataset

class SentimentDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

dataset = SentimentDataset(encodings, labels)
train_loader = DataLoader(dataset, batch_size=16, shuffle=True)

9. Model Training

We define the loss function and optimizer to train the model, proceeding with training over multiple epochs.

from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

model.train()
for epoch in range(3):
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        print(f'Epoch: {epoch}, Loss: {loss.item()}')

10. Model Evaluation

We will evaluate the model to check its performance. We will measure accuracy and loss using the validation dataset.

model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in train_loader:
        outputs = model(**batch)
        predictions = outputs.logits.argmax(dim=-1)
        correct += (predictions == batch['labels']).sum().item()
        total += batch['labels'].size(0)

accuracy = correct / total
print(f'Accuracy: {accuracy}')

11. Making Predictions

Once the model is trained, we can make predictions on new data. Below is an example code to make actual predictions.

def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        outputs = model(**inputs)
    prediction = outputs.logits.argmax(dim=-1)
    return 'Positive' if prediction.item() == 1 else 'Negative'

test_text = "This movie was really enjoyable!"
print(f'Prediction: {predict_sentiment(test_text)}')

12. Conclusion

In this article, we explored the entire process of performing sentiment analysis using the Hugging Face Transformers library. Through fine-tuning the model and predicting real data, we were able to verify the potential applications of deep learning models. We can expect to apply Hugging Face Transformers to various Natural Language Processing tasks in the future.

13. References

Hugging Face Documentation: https://huggingface.co/docs/transformers
IMDb Dataset: https://www.imdb.com/interfaces/
PyTorch Documentation: https://pytorch.org/docs/stable/index.html