Hugging Face Transformers Tutorial, BERT Ensemble Fine-Tuning

In recent years, the field of deep learning has made rapid progress in natural language processing (NLP), and one of the biggest leaders among them is BERT. BERT (Bidirectional Encoder Representations from Transformers) is a model announced by Google that has the capability to understand context in both directions. In this course, we will learn about the BERT model in detail and explain how to enhance performance by ensembling various BERT models, as well as how to fine-tune it using the Hugging Face Transformers library.

1. What is BERT?

BERT stands for ‘Bidirectional Encoder Representations from Transformers’ and is a pre-trained model with a very strong ability to understand context. Traditional NLP models typically understood context in only one direction, but BERT, based on the Transformer architecture, can collect contextual information simultaneously from both directions. This provides the potential to extract the meaning of words according to their context.

1.1. Features of BERT

Bidirectionality: BERT considers the context on both sides of the input text simultaneously.
Pre-trained: After being pre-trained using a large-scale corpus, it can be fine-tuned for specific tasks.
Layered Structure: Composed of multiple Transformer layers, it effectively handles complex contexts.

2. Hugging Face Transformers Library

Hugging Face is a library that makes various pre-trained NLP models, including BERT, easy to use. With this library, you can perform various NLP tasks without complicated implementations. The Hugging Face Transformers library provides a simple interface and intuitive API to facilitate training and fine-tuning.

2.1. Installation Method

!pip install transformers

3. BERT Ensemble Techniques

An ensemble technique is a method of combining multiple models to achieve better performance. The reason for ensembling BERT models is that the diversity among models can prevent overfitting and enhance generalization performance. By utilizing ensemble techniques, you can effectively maximize the strengths of BERT models.

3.1. Ensemble Methodologies

There are various strategies, but two of the most commonly used methods are hard voting and soft voting.

Hard Voting: Adopts the most frequently selected label among each model’s predicted class labels as the result.
Soft Voting: Averages the predicted class probabilities from each model and adopts the class with the highest probability as the result.

4. Fine-tuning BERT

Now, let’s learn how to fine-tune the BERT model. We will proceed by setting up the BERT model step by step and discussing how to ensemble it.

4.1. Preparing the Dataset

First, we prepare the dataset to be used. In the example below, we will use the IMDB movie review data, which is categorized into positive and negative reviews.

4.1.1. Loading the Dataset


import pandas as pd
from sklearn.model_selection import train_test_split

# Load IMDB dataset
data = pd.read_csv('imdb_reviews.csv')
train_data, test_data = train_test_split(data, test_size=0.2)

4.2. Loading the BERT Model

Now, we will load the BERT model using the Hugging Face Transformers library.


from transformers import BertTokenizer, BertForSequenceClassification

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

4.3. Data Preprocessing

To input data into the BERT model, it must be preprocessed. We tokenize the text and generate input IDs and attention masks.


def preprocess_data(data):
    inputs = tokenizer(data['text'].tolist(), padding=True, truncation=True, return_tensors="pt", max_length=512)
    labels = torch.tensor(data['label'].tolist())
    return inputs, labels

train_inputs, train_labels = preprocess_data(train_data)
test_inputs, test_labels = preprocess_data(test_data)

4.4. Training the Model

To train the model, we use PyTorch’s DataLoader and set up the Adam optimizer.


from torch.utils.data import DataLoader, TensorDataset
from transformers import AdamW

train_dataset = TensorDataset(train_inputs['input_ids'], train_inputs['attention_mask'], train_labels)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
optimizer = AdamW(model.parameters(), lr=1e-5)

# Train the model
model.train()
for epoch in range(3):
    for batch in train_loader:
        optimizer.zero_grad()
        input_ids, attention_mask, labels = batch
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        print(f'Epoch: {epoch}, Loss: {loss.item()}')

4.5. Evaluation and Ensemble

Evaluate the trained model and also train other BERT models in the same manner to proceed with the ensemble. Collect the prediction results from each model and use hard or soft voting to achieve the final prediction.


# Model evaluation and ensemble
def evaluate_and_ensemble(models, dataloader):
    ensemble_preds = []
    for model in models:
        model.eval()
        preds = []
        for batch in dataloader:
            input_ids, attention_mask = batch
            with torch.no_grad():
                outputs = model(input_ids, attention_mask=attention_mask)
            preds.append(torch.argmax(outputs.logits, dim=1))
        ensemble_preds.append(torch.cat(preds, dim=0))
    
    # Hard voting
    final_preds = torch.mode(torch.stack(ensemble_preds), dim=0)[0]
    return final_preds

final_predictions = evaluate_and_ensemble([model], test_loader)

5. Conclusion

In this course, we explored how to use the Hugging Face Transformers library to ensemble BERT models and improve performance. Based on BERT’s powerful language understanding abilities, we showed that by performing appropriate data preprocessing and utilizing ensemble techniques, it is possible to achieve high performance in NLP tasks. We look forward to solving various natural language processing problems using these techniques in the future.