Hugging Face Transformers Practical Course, Fine-tuning and BERT Classification

Deep learning and Natural Language Processing (NLP) play a crucial role in modern artificial intelligence technologies. Among them, BERT (Bidirectional Encoder Representations from Transformers) is a powerful language model developed by Google, demonstrating outstanding performance in many NLP tasks. In this course, we will take a detailed look at how to fine-tune the BERT model using Hugging Face’s Transformers library and perform text classification tasks.

1. Introduction to Hugging Face and BERT

Hugging Face is a library that provides various models and tools to make natural language processing models easily accessible. In particular, it allows easy use of transformer-based models like BERT. The BERT model enables a deeper understanding by considering information from both sides of the context. This is why it can achieve superior performance compared to traditional RNN or LSTM-based models.

2. Basic Structure of the BERT Model

BERT is a transformer model with an encoder-decoder structure, primarily utilizing the encoder part. The main features of BERT are as follows:

  • Bidirectional Attention: BERT can learn bidirectional contexts, allowing a richer understanding of the meaning of specific words.
  • Masked Language Model: During training, some words are masked, and the model is trained to predict the masked words.
  • Next Sentence Prediction: Given two sentences, it predicts whether the two sentences are actually consecutive.

3. Installing Hugging Face Transformers

First, you need to install Hugging Face’s Transformers library. You can use the following command to install it:

pip install transformers

4. Preparing the Dataset

To train a deep learning model, an appropriate dataset is required. In this course, we will use the IMDB movie review dataset for a simple text classification task. This dataset consists of positive and negative reviews.


import pandas as pd
from sklearn.model_selection import train_test_split

# Load IMDB dataset
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
# Download and extract the data.
!wget {url}
!tar -xvf aclImdb_v1.tar.gz

# Load positive and negative reviews
pos_reviews = pd.read_csv('aclImdb/train/pos/*.txt', delimiter="\n", header=None)
neg_reviews = pd.read_csv('aclImdb/train/neg/*.txt', delimiter="\n", header=None)

# Prepare the data
positive = [(1, review) for review in pos_reviews[0]]
negative = [(0, review) for review in neg_reviews[0]]

data = positive + negative
df = pd.DataFrame(data, columns=['label', 'review'])
df['label'] = df['label'].map({0: 'negative', 1: 'positive'})

# Split into training and testing data
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

5. Data Preprocessing

To use the BERT model, we need to preprocess the data into an appropriate format. We will use the BERT tokenizer provided by Hugging Face’s Transformers library.


from transformers import BertTokenizer

# Load BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the data
def tokenize_data(data):
    return tokenizer(data['review'].tolist(), padding=True, truncation=True, return_tensors='pt')

train_encodings = tokenize_data(train_df)
test_encodings = tokenize_data(test_df)

6. Creating the Dataset

We convert the tokenized data into a PyTorch Dataset using the Dataset class.


import torch

class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_df['label'].tolist())
test_dataset = IMDbDataset(test_encodings, test_df['label'].tolist())

7. Model Setup and Fine-tuning

Now we will load the BERT model and proceed with fine-tuning for the classification task.


from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Create trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)
    
# Train the model
trainer.train()

8. Evaluation and Prediction

After training the model, we will evaluate its performance on the test dataset and make predictions.


# Evaluate the model
trainer.evaluate()

# Predictions
predictions = trainer.predict(test_dataset)
predicted_labels = predictions.predictions.argmax(-1)

9. Interpreting Results

We calculate the accuracy by comparing the predicted labels with the true labels and analyze if there are areas to improve. We can visualize the detailed results through the Confusion Matrix.


from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

confusion_mtx = confusion_matrix(test_df['label'].tolist(), predicted_labels)
plt.figure(figsize=(10,7))
sns.heatmap(confusion_mtx, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

10. Conclusion

In this course, we explored how to fine-tune the BERT model using the Hugging Face Transformers library and perform text classification tasks. Using pre-trained models like BERT saves time and resources while significantly improving performance. We hope to achieve better performance in various NLP tasks using models like BERT in the future.

References

  • Hugging Face Transformers official documentation: Link
  • BERT original paper: Link

Introduction to Using Hugging Face Transformers, BERT Classification without Fine-Tuning

In this course, we will explore how to perform classification tasks using the BERT model, which is widely used in the fields of deep learning and natural language processing, without fine-tuning. BERT (Bidirectional Encoder Representations from Transformers) is an innovative model developed by Google that excels at understanding context. This course focuses on how to easily utilize the BERT model using the Hugging Face library.

1. What is Hugging Face Transformers?

The Hugging Face Transformers library is a Python library providing a variety of pre-trained NLP models. With this library, you can easily use several natural language processing models, such as BERT, GPT-2, and T5. Also, you can adjust the model for specific tasks through transfer learning.

2. Basic Concepts of BERT

BERT stands for Bidirectional Encoder Representations from Transformers and has the ability to leverage bidirectional contextual information. Unlike traditional RNNs or LSTMs, BERT is based on the Transformer architecture, understanding context by considering all words in the input data simultaneously.

3. Classifying with BERT without Fine-Tuning

Typically, it is common to fine-tune the BERT model for specific tasks. However, we can also perform text classification using the BERT model without fine-tuning. Below is a step-by-step guide on how to use the BERT model without fine-tuning.

3.1 Installing the Library

First, you need to install the necessary libraries. Use the command below to install Hugging Face’s Transformers and Tokenizer libraries.

!pip install transformers torch

3.2 Preparing Data

Next, prepare the data you will use. For example, you could use a dataset that distinguishes between positive and negative sentences. Below is a simple example.

data = [
        {"text": "This movie was really enjoyable.", "label": "positive"},
        {"text": "It's the worst movie.", "label": "negative"},
        {"text": "The acting was truly excellent.", "label": "positive"},
        {"text": "This is a waste of time.", "label": "negative"},
    ]

3.3 Loading the BERT Model and Tokenizer

You can load the pre-trained BERT model and tokenizer using the Hugging Face library. Use the code below to load them.

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

3.4 Preprocessing Text Data

Now, you need to preprocess the data by tokenizing the text and creating input tensors. Transform the data to match the input format of the BERT model.

inputs = tokenizer([d['text'] for d in data], padding=True, truncation=True, return_tensors="pt")
labels = [d['label'] for d in data]

3.5 Extracting Model Outputs

You can generate the output vectors for each text using the BERT model. These vectors will be used for the classification task in the next step.

with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state[:, 0, :].numpy()  # Using the CLS token

3.6 Implementing a Text Classifier

Now, you can build a simple classifier using the embeddings output by the model. For example, you could use logistic regression.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Vectorizing the embeddings and labels
X = embeddings
y = [1 if label == "positive" else 0 for label in labels]

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the logistic regression model
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

# Prediction
y_pred = classifier.predict(X_test)

# Evaluating accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")  # Output the model's accuracy.

4. Conclusion

In this course, we learned how to use the BERT model without fine-tuning using the Hugging Face Transformers library. We confirmed that simple classification tasks can be performed using the pre-trained embeddings of BERT, which helps lay the foundation for natural language processing. We can see that the BERT model can be effectively used in various application areas that require natural language processing.

5. References

Transformers Course Using Hugging Face, Visualization of Fine-tuning BERT Model Training Process

With the advancement of deep learning, many innovations are being made in the field of Natural Language Processing (NLP). In particular, the BERT (Bidirectional Encoder Representations from Transformers) model has gained much popularity due to its performance and efficiency. In this article, we will detail how to fine-tune the BERT model using the Hugging Face library and visualize the process and results.

1. Overview of the BERT Model

BERT is a pre-trained text representation model developed by Google, utilizing a Bidirectional Attention Mechanism to understand the context of words in both directions. BERT is pre-trained through two main tasks: Masked Language Modeling and Next Sentence Prediction. Through this process, BERT exhibits very high performance in natural language understanding and generation tasks.

2. Environment Setup

To fine-tune the BERT model, we first need to install the necessary packages. Use the code below to install Hugging Face’s Transformers and other required libraries.

!pip install transformers torch datasets matplotlib seaborn

3. Preparing the Dataset

In this example, we will work on a binary classification problem where we classify movie reviews as positive or negative using the IMDB movie review dataset. The Hugging Face datasets library allows us to easily load the data.

from datasets import load_dataset

dataset = load_dataset('imdb')
train_dataset = dataset['train']
test_dataset = dataset['test']

4. Data Preprocessing

To input into the BERT model, text tokenization is required. We will prepare the data using Hugging Face’s Tokenizer. Note that the maximum input length for BERT is limited to 512.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)

5. Data Loader Setup

This is the process of loading data and splitting it into batches. We use PyTorch’s DataLoader to create batches necessary for training and validation.

import torch

train_loader = torch.utils.data.DataLoader(tokenized_train, batch_size=16, shuffle=True)
test_loader = torch.utils.data.DataLoader(tokenized_test, batch_size=16)

6. Model Setup

Now we set up the BERT model and prepare to fine-tune it. Hugging Face makes it easy to load the BERT model.

from transformers import BertForSequenceClassification

model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

7. Preparing for Training

We set the loss function and optimization algorithm for fine-tuning. BERT typically uses CrossEntropyLoss to solve classification problems.

from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

8. Model Training

Now we proceed with model training. During each epoch, we train the model using training data and validation data, and evaluate its performance. We monitor the number of epochs, loss values, accuracy, etc.

from tqdm import tqdm

model.train()
for epoch in range(3):
    for batch in tqdm(train_loader):
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)
        labels = batch['label'].to(model.device)

        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch}, Loss: {loss.item()}")

9. Performance Evaluation

After training is complete, we evaluate the model’s performance using the test data. We assess the model using metrics such as accuracy, precision, and recall.

from sklearn.metrics import accuracy_score

model.eval()
predictions, true_labels = [], []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(model.device)
        attention_mask = batch['attention_mask'].to(model.device)
        
        outputs = model(input_ids, attention_mask=attention_mask)
        preds = torch.argmax(outputs.logits, dim=-1)
        
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(batch['label'].numpy())

accuracy = accuracy_score(true_labels, predictions)
print(f'Accuracy: {accuracy}') 

10. Visualizing the Training Process

Visualizing the training process and performance of the model is crucial for understanding and tuning the model. We will visualize the training loss graphically using TorchVision and Matplotlib.

import matplotlib.pyplot as plt

def plot_loss(losses):
    plt.plot(losses, label='Training Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title('Training Loss Over Epochs')
    plt.legend()
    plt.show()

losses = [...]  # Record losses for each epoch
plot_loss(losses)

Conclusion

In this article, we explored the entire process of fine-tuning the BERT model using the Hugging Face library. We demonstrated that fine-tuning the BERT model can be achieved effectively through various steps including dataset preparation, model setup, training process, performance evaluation, and visualization.

Successful implementation of deep learning models requires appropriate data preprocessing, hyperparameter tuning, and result analysis. Remember that pre-trained models like BERT can efficiently solve natural language processing problems.

Using Hugging Face Transformers, PyTorch Pre-training

With the advancement of deep learning, significant innovations are also occurring in the field of Natural Language Processing (NLP). Among them, the Transformers library provided by Hugging Face makes it easy to utilize pre-trained language models. In this article, we will introduce the basic concepts and usage of Hugging Face Transformers, and provide a detailed explanation of how to leverage pre-trained models based on PyTorch.

1. What is the Hugging Face Transformers Library?

The Hugging Face Transformers library includes various pre-trained models for different natural language processing tasks. These models have been trained to perform a wide range of NLP tasks, such as translation, text classification, summarization, and question answering. The transformer architecture demonstrates excellent performance in understanding context by calculating the importance of each word in the input sequence through a self-attention mechanism.

1.1 Installation

To install the Hugging Face Transformers library, you first need to set up a Python environment. You can do this with the following command.

pip install transformers torch

2. Understanding Pre-trained Models

Pre-trained models are trained on large datasets and can be fine-tuned to improve performance for specific tasks. Examples include models such as BERT, GPT-2, and RoBERTa. These models are designed for basic language understanding and have undergone pre-training on various datasets.

2.1 BERT Model

The BERT (Bidirectional Encoder Representations from Transformers) model is a bidirectional sequence encoder that captures the meaning of words by reflecting on context. This model, which shows outstanding performance across numerous natural language processing tasks, has the ability to understand the context surrounding words in a sentence simultaneously.

2.2 GPT-2 Model

GPT-2 (Generative Pre-trained Transformer 2) is a model primarily used for text generation tasks, excelling in understanding and generating context through sequential data processing. This model is widely used to generate documents on specific topics or styles.

3. How to Use with PyTorch

PyTorch is a very useful framework for building and training deep learning models. The Hugging Face Transformers library is designed to work with PyTorch, making it easy to load and utilize models.

3.1 Text Classification Example

In this section, I will show you an example of performing a simple text classification task using the BERT model. We will use the IMDB movie review dataset for text classification.

3.1.1 Preparing the Dataset

First, we will download and preprocess the IMDB dataset. We will use the pandas library for this.

import pandas as pd

# Download IMDB movie review data
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
imdb_data = pd.read_csv(url)

# Data preprocessing
df = pd.DataFrame(imdb_data)
df['label'] = df['label'].map({'pos': 1, 'neg': 0})
df.head()

3.1.2 Loading and Splitting the Dataset

After loading the dataset, we will split it into training and validation sets.

from sklearn.model_selection import train_test_split

# train/test split
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Check the split data
print(f'Train data size: {len(train_data)}, Test data size: {len(test_data)}')

3.1.3 Setting Up Hugging Face Dataset

Now we will load the data using Hugging Face’s Dataset class.

from transformers import Dataset

# Convert to Hugging Face Dataset format
train_dataset = Dataset.from_pandas(train_data)
test_dataset = Dataset.from_pandas(test_data)

3.1.4 Loading the Model

We will load the pre-trained BERT model. These models can be called using the AutoModelForSequenceClassification class.

from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

3.1.5 Data Preprocessing

We will convert the text data into a format that the BERT model can process.

def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=512)

tokenized_train_data = train_dataset.map(preprocess_function, batched=True)
tokenized_test_data = test_dataset.map(preprocess_function, batched=True)

3.1.6 Training the Model

We will use the Trainer class to train the model.

from transformers import Trainer, TrainingArguments

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_test_data
)

# Train the model
trainer.train()

3.1.7 Evaluating the Model

We will evaluate the performance of the trained model.

metrics = trainer.evaluate()
print(metrics)

4. Conclusion

By utilizing the Hugging Face Transformers library, you can easily handle pre-trained models. Models like BERT can be used for various natural language processing tasks beyond text classification and play a significant role in research and industry. In this blog, we have explored the process from installation to training and evaluation of the model. I hope you have acquired useful knowledge that you can apply in practice.

5. References

Using Hugging Face Transformers, Tokenizing and Encoding

Deep learning and natural language processing have rapidly advanced in recent years, and during this process, the Hugging Face Transformer library has become one of the popular tools. In this course, we will deeply explain the concepts of tokenizing and encoding using the Hugging Face Transformer library, and learn how to implement these concepts in Python code.

1. What is a Transformer Model?

A transformer model is a deep learning model based on the attention mechanism, demonstrating high performance in language processing tasks. It was first introduced in the 2017 paper “Attention is All You Need.” This model effectively captures contextual information by considering all words in the input sequence simultaneously.

2. Hugging Face and Its Basics

Hugging Face is a platform that offers many pre-trained transformer models for free. This allows researchers and developers to easily perform various NLP tasks (e.g., question answering, text generation, sentiment analysis). By using Hugging Face’s transformers library, complex NLP tasks can be handled with ease.

3. Tokenizing

Tokenizing is the process of breaking down text into individual units (tokens). For example, splitting a sentence into words or breaking down words into subwords. This process is essential for transforming data into a form that models can understand.

3.1 Why is Tokenizing Important?

Transformer models need to convert input text into fixed-length token sequences. A well-implemented tokenizer can process input data better and enhance the model’s performance.

3.2 Using Hugging Face’s Tokenizer

The Hugging Face Transformer library includes several types of tokenizers. These are optimized for each model, meaning the tokenizers used for BERT, GPT-2, and T5 models differ.

Example: Tokenizing with BERT Model

from transformers import BertTokenizer

# Initialize BERT Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Text Input
text = "Welcome to the NLP course using Hugging Face Transformers!"

# Tokenizing Text
tokens = tokenizer.tokenize(text)
print(tokens)

When you run the code above, you can see the input text split into individual tokens. However, this process must also convert the tokens into a format suitable for the model.

4. Encoding

After tokenizing, we need to convert tokens into numerical forms that the model can input. This process is called encoding, and it generally uses the index of each word to convert tokens into numbers.

Example: Encoding with BERT Model

# Text Encoding
encoded_input = tokenizer.encode(text, return_tensors='pt')
print(encoded_input)

Here, return_tensors='pt' means to return a PyTorch tensor. This is a form that can be directly input into deep learning models.

5. Integrated Example: Data Preprocessing for Text Classification

Now, let’s integrate the tokenizing and encoding processes we’ve learned so far into a single example. Here, we will look at the process of preprocessing data for a simple text classification model.

5.1 Data Preparation

First, we need to prepare data for simple text classification. We will create a list that includes text and its respective labels.

texts = [
    "I really like Hugging Face.",
    "Deep learning is hard but interesting.",
    "AI technology is changing our lives.",
    "The transformer model is really powerful.",
    "This text is about cats."
]
labels = [1, 1, 1, 1, 0]  # 1: Positive, 0: Negative

5.2 Data Tokenizing and Encoding

Next, we will write the code to tokenize and encode the data. We will use a loop to perform tokenization and encoding for each text.

# Initialize an empty list to hold all processed data
encoded_texts = []

# Perform tokenization and encoding for each text
for text in texts:
    encoded_text = tokenizer.encode(text, return_tensors='pt')
    encoded_texts.append(encoded_text)

print(encoded_texts)

5.3 Input into the Model

Now we can input the encoded texts into the model to perform prediction tasks. For example, a text classification model can be used as follows.

from transformers import BertForSequenceClassification
import torch

# Load BERT Model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Switch model to evaluation mode
model.eval()

# Perform prediction for each encoded text
for encoded_text in encoded_texts:
    with torch.no_grad():
        outputs = model(encoded_text)
        predictions = torch.argmax(outputs.logits, dim=-1)
        print(f"Predicted Label: {predictions.item()}")

In the code above, we perform predictions for each encoded text to predict the respective labels of those texts. In this way, we can classify new texts using a deep learning model.

6. Conclusion

In this course, we learned about the importance of tokenizing and encoding using Hugging Face’s Transformer library. Additionally, we implemented a simple text classification model using these concepts. Hugging Face provides a powerful API along with various pre-trained models, making it easier to perform NLP tasks. I hope you continue your learning in deep learning and natural language processing in the future!

7. References