Using Hugging Face Transformers, Pre-training the Trainer Class

Deep learning is currently being utilized in various fields, among which natural language processing (NLP) is a particularly rapidly developing area. Hugging Face is well-known as a platform that provides various libraries to easily handle these deep learning models. In this course, we will explain in detail how to use the pre-training and Trainer classes of Hugging Face’s Transformers library, and provide practical Python code examples.

1. Introduction to Hugging Face Transformers

Hugging Face Transformers is a library that allows easy access to various natural language processing models based on the transformer architecture. This library provides various pre-trained models such as BERT, GPT-2, RoBERTa, and T5. This way, we can perform natural language processing tasks conveniently without the need for complex model implementations.

2. Overview of the Trainer Class

The Trainer class is a high-level API provided by the Hugging Face Transformers library that helps easily perform model training and evaluation. By using the Trainer class, you can train a model without writing a custom training loop. When using the Trainer class, you need to specify the dataset, model, training arguments, and evaluation arguments.

2.1. Installing Required Libraries

First, you need to install the libraries. You can run the following command to install the necessary libraries along with Transformers.

!pip install transformers datasets

2.2. Preparing to Use the Trainer Class

The preparations needed to use the Trainer class are as follows:

  • Loading the Model: Load the desired model from Hugging Face’s model hub.
  • Setting Up the Tokenizer: Set up the tokenizer to convert input data into vectors.
  • Preparing the Dataset: Prepare the dataset for training and evaluation purposes.
  • Setting Training Arguments: Set various arguments to be used during the training process.

3. Preparing the Dataset

We will use the IMDb movie review dataset to train a model that classifies positive and negative reviews. For this purpose, we will download the IMDb dataset using Hugging Face’s datasets library.

from datasets import load_dataset

dataset = load_dataset("imdb")
train_dataset = dataset["train"]
test_dataset = dataset["test"]

4. Setting Up the Model and Tokenizer

We will be using the BERT model and will load the ‘bert-base-uncased’ model provided by Hugging Face. At the same time, we need to set up the tokenizer appropriate for that model.

from transformers import BertTokenizer, BertForSequenceClassification

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

5. Data Preprocessing

We need to preprocess the dataset to fit the model. We will tokenize the text data and, if necessary, add padding to adjust to a fixed length.

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512)

train_tokenized = train_dataset.map(preprocess_function, batched=True)
test_tokenized = test_dataset.map(preprocess_function, batched=True)

6. Setting Up the Trainer Class

Now we need to define the training arguments to set up the Trainer class. Training arguments define the hyperparameters of the training process.

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
)

7. Training the Model

Start model training. You can perform the training with the code below.

trainer.train()

8. Evaluating the Model

After training, you can evaluate the model’s performance. Let’s check the evaluation metrics to see how well the model works.

trainer.evaluate()

9. Predicting with the Model

Now, you can use the trained model to make predictions on new data.

def predict(texts):
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)
    return predictions

sample_texts = ["I love this movie!", "This is the worst film ever."]
predictions = predict(sample_texts)
print(predictions)

10. Conclusion

In this course, we learned about pre-training models using the Trainer class in the Hugging Face Transformers library. Hugging Face provides various pre-trained models to facilitate various tasks in natural language processing. We hope you have learned how to easily train and evaluate models through this example. We encourage you to continue exploring the various possibilities of Hugging Face and deep learning in the future.

Thank you!

Using Hugging Face Transformers, PEGASUS Automatic Summarization

Automatic summarization is one of the most important tasks in the field of Natural Language Processing (NLP). Summarizing long texts without human intervention to convey the essence of information is essential in many real-world applications. This course will explain how to perform automatic summarization using the PEGASUS model provided by Hugging Face.

1. What is PEGASUS?

PEGASUS is a deep learning model for automatic summarization developed by Google. This model is based on the Transformer architecture and has shown high performance on various text summarization tasks. PEGASUS excels in selecting and generating important sentences, which enables it to summarize long texts effectively.

1.1. Basic Principles of PEGASUS

PEGASUS has the ability to effectively summarize key information from input documents. The model selects important parts from the given document and generates a short summary based on it. Generally, the PEGASUS model summarizes in the following two steps:

  • Text Encoding: Encodes the input long text to extract meaning.
  • Summary Generation: Generates a short summary based on the encoded information.

2. Environment Setup

This course will use Python and the Transformers library from Hugging Face. Please follow the steps below to set up the environment:

2.1. Install Required Libraries

pip install transformers torch

You can install the Transformers library from Hugging Face and PyTorch using the command above. PyTorch is the fundamental library used for training and inference of deep learning models.

3. Loading the PEGASUS Model

You are now ready to load and use the PEGASUS model. Use the code below to load the model and tokenizer:

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load PEGASUS model and tokenizer
model_name = "google/pegasus-large"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

3.1. Defining the Document Summarization Function

Now let’s define a function that takes a document as input and generates a summary.

def summarize_text(text):
    # Tokenize the input text
    inputs = tokenizer(text, truncation=True, padding="longest", return_tensors="pt")
    
    # Generate summary
    summary_ids = model.generate(inputs["input_ids"], num_beams=4, min_length=30, max_length=200, early_stopping=True)
    
    # Convert summary ids to text
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

4. Summary Example

Now, let’s use the summarize_text function defined above to generate a summary of an actual document.

# Example text
document = """
On February 27, 2023, OpenAI announced the new artificial intelligence language model GPT-4. This model demonstrates superior performance compared to previous models and can perform various language processing tasks.
GPT-4 has been trained on a large-scale dataset and can be used in areas such as natural language generation, translation, question and answer, and more.
Additionally, GPT-4 can generate customized responses tailored to user needs, attracting significant interest from companies and researchers.
"""

# Generate summary
summary = summarize_text(document)
print("Original Document:")
print(document)
print("\nGenerated Summary:")
print(summary)

5. Result Analysis

Let’s analyze the generated summary results. The quality of the summary depends on how well the key information from the original document is reflected. The PEGASUS model demonstrates strong summarization capabilities for long texts, but there may be specific settings or numerical limitations. Therefore, it is important to review the results and adjust parameters as needed.

6. Parameter Tuning

To improve the quality of the model’s summarization, various hyperparameters can be adjusted. The main parameters include num_beams, min_length, and max_length. The meanings of these parameters are as follows:

  • num_beams: The number of beams used in beam search. A larger value considers more candidate summaries, but increases computational costs.
  • min_length: The minimum length of the generated summary. This value is important for ensuring the meaning of the generated summary.
  • max_length: The maximum length of the generated summary. This value helps adjust the summary to prevent it from being too long.

7. Conclusion

In this course, we learned how to perform automatic summarization using Hugging Face’s PEGASUS model. PEGASUS is a highly useful tool in the field of natural language processing, capable of effectively conveying large amounts of information. Future advancements in summarization models or methodologies are expected, making continuous attention and learning necessary.

8. References

Transfomer Utilization Course on Hugging Face, Setting Up the PEGASUS Library and Loading Pre-trained Models

Recent innovations in the field of Natural Language Processing (NLP) have been made possible by advancements in deep learning models. In particular, the Transformers library developed by Hugging Face has become a symbol of this progress. In this course, we will cover in detail the library setup and how to load pretrained models necessary to perform text summarization tasks using the PEGASUS model.

1. What is the Hugging Face Transformers Library?

The Hugging Face Transformers library is a Python library that provides pretrained models for various NLP tasks. This library particularly offers a variety of models (BERT, GPT-2, RoBERTa, T5, etc.) based on the transformer architecture. PEGASUS is one of the models based on this transformer architecture, primarily designed for text summarization.

2. Introduction to the PEGASUS Model

PEGASUS (Pre-trained Text-to-Text Transfer Transformer) is a model developed by Google, optimized for extracting important information from natural language documents and summarizing them. The core idea of the PEGASUS model is to mask randomly selected sentences from the input document and perform pre-training by predicting the masked sentences. During this process, the model learns to understand the overall context of the text and identify important information.

2.1. Advantages of the PEGASUS Model

  • Excellent text summarization performance
  • Can be trained with less data using pretrained models
  • Usable in various languages and domains

3. Environment Setup

To use the PEGASUS model, you first need to install the necessary libraries. This process primarily requires the installation of the transformers and torch libraries. Below is the installation method.

pip install transformers torch

3.1. Importing Necessary Libraries

Once the installation is complete, import the necessary libraries as follows.

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

4. Loading the PEGASUS Model

Now it’s time to load the PEGASUS model and tokenizer. The PEGASUS model can be easily loaded from Hugging Face’s model hub.

model_name = "google/pegasus-xsum"

# Load tokenizer and model
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

4.1 Preparing the Text to Summarize

To use the model, you need to prepare the text to be summarized. The code below defines a sample text.

sample_text = "Natural language processing is a field of computer science that deals with the interaction between computers and human language. It researches how to understand and process human language."

4.2 Tokenizing Text and Summarization

After tokenizing the text using the tokenizer, set the batch size and perform the summarization.

# Text encoding
inputs = tokenizer(sample_text, return_tensors="pt", max_length=512, truncation=True)

# Generate summary
summary_ids = model.generate(inputs["input_ids"], num_beams=4, max_length=50, early_stopping=True)

# Decode summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:", summary)

5. Analyzing Summary Results

Let’s check the summary results generated from the code above. The generated summary should concisely convey the most important information from the original document. Potential issues that may arise in this process include incorrect sentence formation or missing information. To address these issues, the quantity and quality of the training data are crucially important.

6. Conclusion

Through this course, we learned how to perform text summarization using the PEGASUS model. The PEGASUS model is a powerful pretrained natural language processing model that has established itself as an effective summarization tool. We also confirmed that we can easily load and use the model through the Hugging Face Transformers library.

In future courses, we will cover how to fine-tune the PEGASUS model to adjust it for specific domains, and how to improve performance by adjusting various hyperparameters. The world of NLP is vast and offers a range of application possibilities. Continue to learn and research to develop even more effective models.

© 2023 Blog Author

Hugging Face Transformers Tutorial, Mobile BERT Inference Using the Last Hidden Layer

In recent years, deep learning-based models have gained popularity and made significant advancements in the field of Natural Language Processing (NLP). Among them, Hugging Face’s transformer models are popular due to their ease of use and performance. In particular, Mobile BERT is a lightweight version of the BERT model, designed to be effectively used in mobile environments. In this course, we will introduce how to extract the output of the last hidden layer using the Mobile BERT model.

1. What is Mobile BERT?

Mobile BERT is a model released by Google as a lightweight version of BERT. The BERT model is based on two main components: Encoder and Decoder, and Mobile BERT has optimized the Encoder to be lightweight so that it can be used on various mobile devices. Mobile BERT has 4 times fewer parameters and applies various techniques to enhance computational efficiency.

2. Installing the Hugging Face Library

To use the Hugging Face transformer model, you first need to install the required libraries. You can use the command below to install the libraries.

pip install transformers torch

3. Loading the Mobile BERT Model

Once the model is installed, you can load the Mobile BERT model. Here is a basic code snippet.

from transformers import MobileBertTokenizer, MobileBertModel

# Load Mobile BERT model and tokenizer
tokenizer = MobileBertTokenizer.from_pretrained('google/mobilebert-uncased')
model = MobileBertModel.from_pretrained('google/mobilebert-uncased')

4. Preprocessing Input Data

The data input to the Mobile BERT model must be in text format and should be converted into the appropriate format through the tokenizer. Here’s how to preprocess the input sentence.

# Define input sentence
input_text = "Try using Hugging Face's transformer!"

# Tokenize the sentence and convert to indices
inputs = tokenizer(input_text, return_tensors='pt')

5. Inference via the Model

Once preprocessing is complete, the data can be input to the Mobile BERT model to obtain the output of the last hidden layer. The output can be computed using the model’s forward method.

with torch.no_grad():
    outputs = model(**inputs)

# The output of the last hidden layer is stored in outputs[0].
last_hidden_states = outputs.last_hidden_state
print(last_hidden_states.shape)  # (batch size, sequence length, hidden size)

6. Interpreting Results

The output of the last hidden layer is returned as a 3-dimensional tensor. The first dimension of the tensor is the batch size, the second dimension is the sequence length (the number of words in the sentence), and the third dimension is the dimension of the hidden layer. For example, if the batch size is 1, the sequence length is 10, and the hidden layer dimension is 768, the shape of the output will be (1, 10, 768).

7. Application Example: Extracting Embedding Vectors

The output of the last hidden layer can be used as the embedding vectors for each word. These vectors can be utilized for various NLP tasks.

# Extract embedding vector of the first word
word_embedding = last_hidden_states[0, 0, :]  # (768,)
print(word_embedding.shape)

8. Summary

In this post, we explored how to utilize the Mobile BERT model using the Hugging Face transformer library. We covered data preprocessing, inference processes, and how to obtain the sequence output of the last hidden layer. These methods can be employed in various NLP applications and are being used in many research and industrial fields.

9. References

You can find more detailed information at the following links:

huggingface transformer training course, Mobile BERT fill-in-the-blank quiz

In this tutorial, we will create a fill-in-the-blank quiz using the Hugging Face Transformers library with the Mobile BERT model. Mobile BERT is a lightweight version of the BERT model designed for effective use in mobile environments, and it is utilized for various NLP tasks such as text embedding, question answering, and text classification.

1. Prerequisites

The environments and libraries required to proceed with this tutorial are as follows:

  • Python 3.6 or higher
  • Transformers library
  • torch library
  • pandas (optional, for using dataframes)

2. Environment Setup

!pip install transformers torch pandas

3. Introduction to Mobile BERT Model

Mobile BERT is a lightweight variant model based on BERT (Base Bidirectional Encoder Representations from Transformers) developed by Google. Mobile BERT utilizes the same architecture as BERT but has undergone several technical adjustments to reduce model size and increase execution speed. It is designed to support natural language processing tasks on mobile and edge devices.

4. Data Preparation

In this example, we will prepare text samples to construct fill-in-the-blank questions. The sample text will have specific words represented as blanks, and our goal is to find the most suitable word for those positions.

sample_text = "I love [MASK]. Machine learning is a type of [MASK]."

5. Loading the Mobile BERT Model

We will load the Mobile BERT model using Hugging Face’s Transformers library. Use the code below to import the model and tokenizer:


from transformers import MobileBertTokenizer, MobileBertForMaskedLM
import torch

# Load Mobile BERT tokenizer and model
tokenizer = MobileBertTokenizer.from_pretrained('google/mobilebert-uncased')
model = MobileBertForMaskedLM.from_pretrained('google/mobilebert-uncased')

6. Implementing the Fill-in-the-Blank Function

Now, we will implement a function that performs the fill-in-the-blank task. This function takes text as input, tokenizes a sentence containing [MASK], and returns the predicted results using the model.


def fill_mask(text):
    # Tokenize the text
    input_ids = tokenizer.encode(text, return_tensors='pt')
    
    # Model prediction
    with torch.no_grad():
        outputs = model(input_ids)
    
    # Get the predicted token IDs
    predictions = outputs.logits.argmax(dim=-1)
    
    # Restore text from predicted words
    filled_text = tokenizer.decode(predictions[0])
    return filled_text

7. Calling the Fill-in-the-Blank Function

Now let’s use the implemented function to perform the fill-in-the-blank task. Below is the code using a sample sentence with blanks.


# Sample text with blanks
sample_text = "I love [MASK]. Machine learning is a type of [MASK]."

# Call the fill-in-the-blank function
filled_text = fill_mask(sample_text)
print(filled_text)

8. Interpreting the Results

Interpreting the results predicted by the model. Mobile BERT is a pre-trained model with excellent performance in understanding the context of natural language and selecting appropriate words. This example helps us understand how the model fills in the blanks.

9. Practice: Fill-in-the-Blank for Multiple Sentences

Let’s practice filling in blanks for several sentences. Put multiple samples into a list and check the results using a loop.


# Multiple sentences with blanks
samples = [
    "I love [MASK].",
    "Machine learning is a type of [MASK].",
    "[MASK] is a very important concept."
]

# Fill in the blanks for each sample
for sample in samples:
    filled = fill_mask(sample)
    print(f"Original sentence: {sample} -> Filled sentence: {filled}")

10. Conclusion

In this tutorial, we addressed the NLP fill-in-the-blank problem utilizing Mobile BERT. By using Hugging Face’s Transformers library, complex natural language processing tasks can be performed easily. Mobile BERT operates efficiently in mobile environments, making it highly suitable for lightweight machine learning applications.

11. References