Using Hugging Face Transformers, Pre-training the Trainer Class

Deep learning is currently being utilized in various fields, among which natural language processing (NLP) is a particularly rapidly developing area. Hugging Face is well-known as a platform that provides various libraries to easily handle these deep learning models. In this course, we will explain in detail how to use the pre-training and Trainer classes of Hugging Face’s Transformers library, and provide practical Python code examples.

1. Introduction to Hugging Face Transformers

Hugging Face Transformers is a library that allows easy access to various natural language processing models based on the transformer architecture. This library provides various pre-trained models such as BERT, GPT-2, RoBERTa, and T5. This way, we can perform natural language processing tasks conveniently without the need for complex model implementations.

2. Overview of the Trainer Class

The Trainer class is a high-level API provided by the Hugging Face Transformers library that helps easily perform model training and evaluation. By using the Trainer class, you can train a model without writing a custom training loop. When using the Trainer class, you need to specify the dataset, model, training arguments, and evaluation arguments.

2.1. Installing Required Libraries

First, you need to install the libraries. You can run the following command to install the necessary libraries along with Transformers.

!pip install transformers datasets

2.2. Preparing to Use the Trainer Class

The preparations needed to use the Trainer class are as follows:

  • Loading the Model: Load the desired model from Hugging Face’s model hub.
  • Setting Up the Tokenizer: Set up the tokenizer to convert input data into vectors.
  • Preparing the Dataset: Prepare the dataset for training and evaluation purposes.
  • Setting Training Arguments: Set various arguments to be used during the training process.

3. Preparing the Dataset

We will use the IMDb movie review dataset to train a model that classifies positive and negative reviews. For this purpose, we will download the IMDb dataset using Hugging Face’s datasets library.

from datasets import load_dataset

dataset = load_dataset("imdb")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

4. Setting Up the Model and Tokenizer

We will be using the BERT model and will load the ‘bert-base-uncased’ model provided by Hugging Face. At the same time, we need to set up the tokenizer appropriate for that model.

from transformers import BertTokenizer, BertForSequenceClassification

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

5. Data Preprocessing

We need to preprocess the dataset to fit the model. We will tokenize the text data and, if necessary, add padding to adjust to a fixed length.

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

train_tokenized = train_dataset.map(preprocess_function, batched=True)
test_tokenized = test_dataset.map(preprocess_function, batched=True)

6. Setting Up the Trainer Class

Now we need to define the training arguments to set up the Trainer class. Training arguments define the hyperparameters of the training process.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
)

7. Training the Model

Start model training. You can perform the training with the code below.

trainer.train()

8. Evaluating the Model

After training, you can evaluate the model’s performance. Let’s check the evaluation metrics to see how well the model works.

trainer.evaluate()

9. Predicting with the Model

Now, you can use the trained model to make predictions on new data.

def predict(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions

sample_texts = ["I love this movie!", "This is the worst film ever."]
predictions = predict(sample_texts)
print(predictions)

10. Conclusion

In this course, we learned about pre-training models using the Trainer class in the Hugging Face Transformers library. Hugging Face provides various pre-trained models to facilitate various tasks in natural language processing. We hope you have learned how to easily train and evaluate models through this example. We encourage you to continue exploring the various possibilities of Hugging Face and deep learning in the future.

Thank you!