Using Hugging Face Transformers, Classification Report

In recent years, the field of Natural Language Processing (NLP) has made significant advancements. At the center of this is Deep Learning and Transformer models, particularly Hugging Face‘s Transformers library, which is widely used by many researchers and developers. In this article, we will explore how to train and evaluate a text classification model using Hugging Face’s Transformers library.

1. Introduction to Hugging Face Transformers Library

Hugging Face’s Transformers library is an open-source library that helps users easily utilize various pre-trained transformer models and fine-tune them according to their own data. It includes various models such as BERT, GPT-2, and RoBERTa, and its API is intuitive and easy to use.

2. Definition of Text Classification Problem

Text classification is the task of categorizing given text data into one or more class labels. For example, it involves determining whether an email is spam or not, or classifying a movie review as positive or negative. In this course, we will build a model to classify IMDB movie reviews as positive and negative using a simple example.

3. Data Loading and Basic Preprocessing

First, we will install the necessary libraries and load the IMDB dataset. The IMDB dataset includes movie reviews and their corresponding sentiment labels.

python
# Install necessary libraries
!pip install transformers torch datasets

# Import libraries
from datasets import load_dataset

# Load IMDB dataset
dataset = load_dataset('imdb')

print(dataset)

When the above code is executed, you will see that the IMDB dataset is loaded and divided into train and validation sets. Each dataset includes movie reviews and sentiment labels.

4. Data Preprocessing

To input data into the model, text tokenization and encoding are needed. We will process the text using the tokenizer provided by Hugging Face’s transformer models.

python
from transformers import AutoTokenizer

# Set model name
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check sample data
sample_text = dataset['train'][0]['text']
encoded_input = tokenizer(sample_text, padding='max_length', truncation=True, return_tensors='pt')

print(encoded_input)

The above code tokenizes the first movie review using the DistilBERT model’s tokenizer and outputs the encoded tensor after applying padding and truncation to fit the maximum length.

5. Model Definition and Training

Now we will define the model and proceed with training. The Hugging Face Trainer API allows us to conduct the training process conveniently.

python
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Create Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test']
)

# Train the model
trainer.train()

The above code defines a text classification model based on DistilBERT and trains it using the Trainer API with the IMDB dataset. Once the training is complete, the weights are saved in the ./results folder.

6. Model Evaluation

After training the model, we will evaluate its performance using the test dataset. We will use accuracy as the evaluation metric.

python
# Evaluate with test dataset
results = trainer.evaluate()

print(f"Accuracy: {results['eval_accuracy']:.2f}")

After model evaluation, the accuracy will be printed. This allows us to check the model’s performance.

7. Predictions and Classification Report

Now, we can use the trained model to make predictions on new data. We will check the prediction results and print the classification report with the following code.

python
from sklearn.metrics import classification_report
import numpy as np

# Prepare prediction data
predictions = trainer.predict(dataset['test'])
preds = np.argmax(predictions.predictions, axis=1)

# Print classification report
report = classification_report(dataset['test']['label'], preds)
print(report)

The above code performs predictions on the test dataset and uses sklearn’s classification_report to output metrics such as Precision, Recall, and F1-Score. This report provides detailed information about the model’s performance.

8. Conclusion and Next Steps

In this course, we explored how to build a simple text classification model using Hugging Face’s Transformers library and evaluate it. To continuously improve the model’s performance, more diverse techniques can be applied during the data preprocessing stage, or hyperparameter tuning can be considered.

In the future, I plan to cover various natural language processing problems and conduct advanced courses utilizing the Hugging Face Transformers library, so I appreciate your interest. Thank you!