huggingface transformers tutorial, classification accuracy

The advancement of deep learning and natural language processing (NLP) is one of the key elements driving today’s technological innovation. While there are several libraries and frameworks available, the Hugging Face Transformers library is particularly designed to be intuitive and user-friendly. This article will discuss how to build a document classification model using Hugging Face’s Transformers and evaluate the model’s performance.

1. What is Hugging Face Transformers?

The Hugging Face Transformers library supports various model architectures and makes it easy to use pre-trained models. Transformers are models that have revolutionized natural language processing, based on the Attention Mechanism. These models are pre-trained on large datasets and can be fine-tuned for specific tasks.

2. Installing Required Libraries

We will install the necessary libraries to use Hugging Face Transformers. The primary libraries we will use are transformers, torch, and datasets. Use the following command to install them:

!pip install transformers torch datasets

3. Preparing the Dataset

We will prepare the dataset for document classification. Here, we will use the AG News dataset. AG News is a dataset for news article classification, which has four classes:

  • World
  • Sports
  • Business
  • Science/Technology

Running the following code will download the dataset and split it into training and testing data.

from datasets import load_dataset

dataset = load_dataset("ag_news")

4. Data Preprocessing

After loading the data, we need to separate the texts and labels and perform the necessary preprocessing. The following code shows the process of checking sample data and labels.

train_texts = dataset['train']['text']
train_labels = dataset['train']['label']

test_texts = dataset['test']['text']
test_labels = dataset['test']['label']

print("Sample news article:", train_texts[0])
print("Label:", train_labels[0])

5. Preparing the Model and Tokenizer

Now, we will load the pre-trained model and tokenizer using the transformers library. Here, we will use the BertForSequenceClassification model.

from transformers import BertTokenizer, BertForSequenceClassification

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=4)

6. Data Tokenization

We tokenize each text for document classification according to the BERT model. The following code adds padding to facilitate batch processing.

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

7. Training the Model

We use the Trainer class to train the model. The Trainer automatically handles training and evaluation. The following code includes the setup and preparation process for training.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['test']
)

trainer.train()

8. Evaluating the Model

After training the model, we can measure its performance through the evaluation function. We will use the metrics library to calculate accuracy.

import numpy as np
from sklearn.metrics import accuracy_score

predictions, label_ids, _ = trainer.predict(tokenized_datasets['test'])
preds = np.argmax(predictions, axis=1)

accuracy = accuracy_score(label_ids, preds)
print("Classification accuracy:", accuracy)

9. Conclusion

We learned how to load a dataset and perform text classification using a pre-trained model with Hugging Face Transformers. Through this process, we saw the usefulness of transformer models in natural language processing tasks. Additionally, further performance improvements can be achieved by trying hyperparameter tuning or various models.

10. References

For readers looking for more information and examples, the following resources are recommended: