Hugging Face Transformers Tutorial, Loading Pre-trained BERT for Ensemble Training

Today, we will learn how to apply the most commonly used BERT (Bidirectional Encoder Representations from Transformers) model in ensemble training. In this process, we will explain how to load a pre-trained BERT model using the Hugging Face Transformers library and build an ensemble model based on it.

How does BERT work?

The BERT model is a model that performs bidirectional transfer learning to understand context. This means that it considers how the words on the left and right of the input sentence form the context simultaneously. Through this, it can understand the meaning of words in a deeper way. BERT is pre-trained in an unsupervised manner on large amounts of text data and can be fine-tuned for various downstream tasks.

Installing the Hugging Face Library

The Hugging Face Transformers library makes it easy to use various pre-trained models like BERT. To install the library, you can run the command below:

pip install transformers torch

Loading a Pre-trained BERT Model

Now we will load a pre-trained BERT model using the Hugging Face Transformers library. The code below is a simple example that loads the BERT model and its tokenizer.


from transformers import BertTokenizer, BertModel

# Loading BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Test sentence
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors='pt')

# Inputting into BERT model and getting output
outputs = model(**inputs)
print(outputs)

Code Explanation

from transformers import BertTokenizer, BertModel: Imports BERT tokenizer and model from Hugging Face’s Transformers.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased'): Loads a pre-trained BERT tokenizer.
model = BertModel.from_pretrained('bert-base-uncased'): Loads a pre-trained BERT model.
inputs = tokenizer(text, return_tensors='pt'): Tokenizes the input sentence and converts it to a PyTorch tensor.
outputs = model(**inputs): Inputs it to the model and gets the output.

Overview of Ensemble Training

Ensemble learning is a method to improve final prediction performance by combining the prediction results of multiple models. It allows for more reliable prediction results by merging the strengths of various learning models. Generally, there can be several techniques in ensemble learning, with Bagging and Boosting being widely used.

Building Ensemble Models with BERT

Now let’s see how to construct an ensemble model using the BERT model. We will train multiple BERT models and combine their predictions to derive the final prediction.

Model Overview

We will construct the ensemble model with the following structure:

Creating and training multiple BERT models
Collecting the prediction results of each model
Combining the prediction results to generate the final prediction

Preparing the Dataset

We will use a simple text classification problem. For example, we can assume problems like email spam filtering. First, we will prepare a convenient dataset as follows.


import pandas as pd

# Creating an example dataset
data = {'text': ["Free money now", "Hello friend, how are you?", "Limited time offer", "Nice to see you"],
        'label': [1, 0, 1, 0]}  # 1: Spam, 0: Regular mail
df = pd.DataFrame(data)

Training the Model and Performing Ensemble

Now let’s proceed to train each BERT model. The trained models will be saved for ensemble purposes.


from sklearn.model_selection import train_test_split
import torch

# Splitting the data
train_texts, test_texts, train_labels, test_labels = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# Preparing data for BERT model
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True, return_tensors='pt')
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True, return_tensors='pt')

class BERTClassifier(torch.nn.Module):
    def __init__(self):
        super(BERTClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.classifier = torch.nn.Linear(self.bert.config.hidden_size, 2)  # 2 classes (spam, non-spam)

    def forward(self, input_ids, attention_mask):
        output = self.bert(input_ids, attention_mask=attention_mask)[1]
        return self.classifier(output)

# Declaring model and optimizer
model1 = BERTClassifier()
model2 = BERTClassifier()  # Example for the second model
optimizer = torch.optim.Adam(model1.parameters(), lr=5e-5)

# Simple training loop
model1.train()
for epoch in range(3):  # 3 epochs
    optimizer.zero_grad()
    outputs = model1(input_ids=train_encodings['input_ids'], attention_mask=train_encodings['attention_mask'])
    loss = torch.nn.CrossEntropyLoss()(outputs, torch.tensor(train_labels.values))
    loss.backward()
    optimizer.step()
    print(f'Epoch {epoch + 1}, Loss: {loss.item()}')
    # Train model2 in the same way

# Saving the model
torch.save(model1.state_dict(), 'bert_model1.pth')
torch.save(model2.state_dict(), 'bert_model2.pth')

Performing Predictions and Ensemble Results

Once model training is complete, we can combine the prediction results of each model to generate the final prediction values.


# Defining the prediction function
def predict(model, encodings):
    model.eval()
    with torch.no_grad():
        outputs = model(input_ids=encodings['input_ids'], attention_mask=encodings['attention_mask'])
    return torch.argmax(outputs, dim=1)

# Loading models
model1.load_state_dict(torch.load('bert_model1.pth'))
model2.load_state_dict(torch.load('bert_model2.pth'))

# Individual model predictions
preds_model1 = predict(model1, test_encodings)
preds_model2 = predict(model2, test_encodings)

# Ensemble prediction
final_preds = (preds_model1 + preds_model2) / 2
final_preds = (final_preds > 0.5).int()  # Using 0.5 as the threshold for binary prediction
print(f'Final Prediction: {final_preds}')

Conclusion

Today, we explored how to load a pre-trained BERT model using the Hugging Face Transformers library and examined a simple ensemble training method based on it. BERT shows excellent performance in complex natural language processing tasks, and we can expect even better performance when utilizing ensemble techniques. In the future, consider applying these techniques to various tasks to achieve better results.