The advancement of deep learning and natural language processing (NLP) is one of the key elements driving today’s technological innovation. While there are several libraries and frameworks available, the Hugging Face Transformers library is particularly designed to be intuitive and user-friendly. This article will discuss how to build a document classification model using Hugging Face’s Transformers and evaluate the model’s performance.
1. What is Hugging Face Transformers?
The Hugging Face Transformers library supports various model architectures and makes it easy to use pre-trained models. Transformers are models that have revolutionized natural language processing, based on the Attention Mechanism. These models are pre-trained on large datasets and can be fine-tuned for specific tasks.
2. Installing Required Libraries
We will install the necessary libraries to use Hugging Face Transformers. The primary libraries we will use are transformers, torch, and datasets. Use the following command to install them:
!pip install transformers torch datasets
3. Preparing the Dataset
We will prepare the dataset for document classification. Here, we will use the AG News dataset. AG News is a dataset for news article classification, which has four classes:
- World
- Sports
- Business
- Science/Technology
Running the following code will download the dataset and split it into training and testing data.
from datasets import load_dataset
dataset = load_dataset("ag_news")
4. Data Preprocessing
After loading the data, we need to separate the texts and labels and perform the necessary preprocessing. The following code shows the process of checking sample data and labels.
train_texts = dataset['train']['text']
train_labels = dataset['train']['label']
test_texts = dataset['test']['text']
test_labels = dataset['test']['label']
print("Sample news article:", train_texts[0])
print("Label:", train_labels[0])
5. Preparing the Model and Tokenizer
Now, we will load the pre-trained model and tokenizer using the transformers library. Here, we will use the BertForSequenceClassification model.
from transformers import BertTokenizer, BertForSequenceClassification
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=4)
6. Data Tokenization
We tokenize each text for document classification according to the BERT model. The following code adds padding to facilitate batch processing.
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
7. Training the Model
We use the Trainer class to train the model. The Trainer automatically handles training and evaluation. The following code includes the setup and preparation process for training.
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['test']
)
trainer.train()
8. Evaluating the Model
After training the model, we can measure its performance through the evaluation function. We will use the metrics library to calculate accuracy.
import numpy as np
from sklearn.metrics import accuracy_score
predictions, label_ids, _ = trainer.predict(tokenized_datasets['test'])
preds = np.argmax(predictions, axis=1)
accuracy = accuracy_score(label_ids, preds)
print("Classification accuracy:", accuracy)
9. Conclusion
We learned how to load a dataset and perform text classification using a pre-trained model with Hugging Face Transformers. Through this process, we saw the usefulness of transformer models in natural language processing tasks. Additionally, further performance improvements can be achieved by trying hyperparameter tuning or various models.
10. References
For readers looking for more information and examples, the following resources are recommended: