Using Hugging Face Transformers Course, Installing Mobile BERT Library and Loading Pre-trained Models

In the field of deep learning, natural language processing (NLP) plays a very important role. In particular, the BERT (Bidirectional Encoder Representations from Transformers) model is widely used in the field of NLP.
In this course, we will explain how to install the Mobile BERT model using Hugging Face’s Transformers library and how to load the pre-trained model.
Mobile BERT is a lightweight BERT model, which has the advantage of being efficiently usable on mobile devices.

1. What is the Hugging Face Transformers Library?

The Hugging Face Transformers library is a Python library that helps you easily use various state-of-the-art NLP models.
Through this library, you can load various pre-trained models such as BERT, GPT-2, T5, and apply them to various NLP tasks. It also provides APIs for user-customized training.

2. Understanding Mobile BERT

Mobile BERT is a lightweight BERT model developed by Google. Traditional BERT models are pre-trained on large-scale datasets and show strong performance, but
their large size poses constraints for use on mobile devices or embedded systems.
In contrast, Mobile BERT is designed to reduce size while maintaining as much performance as possible. Thanks to this characteristic, Mobile BERT is being utilized in various NLP tasks.

3. Environment Setup and Library Installation

To use Mobile BERT, you first need to install the Hugging Face Transformers library and other necessary libraries.
You can install the required libraries using pip with the following command:

pip install transformers torch

Once installation is completed with the command above, you will be ready to use Mobile BERT in your Python environment.

Note: If you haven’t installed PyTorch, you need to do so. If you are using a GPU that supports CUDA,
choose the appropriate version of PyTorch for CUDA from the official website and install it.

4. Loading Pre-trained Models

Now, let’s load the Mobile BERT model. The Hugging Face Transformers library provides several classes to easily use pre-trained models.

4.1 Code Example

The following code loads Mobile BERT:

from transformers import MobileBertTokenizer, MobileBertForSequenceClassification
import torch

# Load Mobile BERT model and tokenizer
model_name = "google/mobilebert-uncased"
tokenizer = MobileBertTokenizer.from_pretrained(model_name)
model = MobileBertForSequenceClassification.from_pretrained(model_name)

# Sentence to test
input_text = "The Hugging Face transformer is very useful!"

# Tokenize the input sentence and convert to tensor
inputs = tokenizer(input_text, return_tensors="pt")

# Input the data into the model and predict
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class = torch.argmax(logits, dim=-1).item()
print(f"Predicted class: {predicted_class}")

4.2 Code Explanation

Examining each element of the code, we have:

  • from transformers import MobileBertTokenizer, MobileBertForSequenceClassification:
    Loads the Mobile BERT model and tokenizer.
  • model_name = "google/mobilebert-uncased": Sets the name of the pre-trained model to use.
  • tokenizer = MobileBertTokenizer.from_pretrained(model_name): Initializes the tokenizer for the model.
  • model = MobileBertForSequenceClassification.from_pretrained(model_name): Initializes the model.
    At this point, the model is suitable for sentence classification tasks.
  • inputs = tokenizer(input_text, return_tensors="pt"): Tokenizes the input sentence and converts it to a PyTorch tensor.
  • with torch.no_grad():: Configures to not track gradients of tensors for better memory efficiency.
  • logits = model(**inputs).logits: Retrieves the predictions made by the model.
  • predicted_class = torch.argmax(logits, dim=-1).item(): Selects the class with the highest probability among the predicted classes.

5. A Practical Example Using Mobile BERT

Let’s take a look at an example of performing sentence classification using the Mobile BERT model.
The approach is to classify whether the given sentence is positive or negative.

5.1 Preparing the Dataset

First, we need to prepare reliable data. For example, a movie review dataset is divided into positive and negative reviews.
This can be used to train the model. Let’s write code to load and preprocess the data.

import pandas as pd

# Load sample data (5 positive reviews, 5 negative reviews)
data = {
    "text": [
        "I really love this movie.", 
        "It's the best movie!", 
        "It was really moving.", 
        "A perfect masterpiece.", 
        "This movie touched my heart.",
        "This is a waste of time.", 
        "It's bad and boring.", 
        "I was really disappointed.", 
        "Never watch it.", 
        "This movie is the worst."
    ],
    "label": [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
}

df = pd.DataFrame(data)
print(df.head())

5.2 Training the Model

Now we move on to the process of training the model with the data. We can write a simple training loop to train the model.
However, here we will not delve into the details of the training process, but we will proceed with a simple transfer learning using the provided data.

from torch.utils.data import DataLoader, Dataset

class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(text, return_tensors='pt', padding='max_length', truncation=True, max_length=128)
        return {'input_ids': encoding['input_ids'].flatten(), 'attention_mask': encoding['attention_mask'].flatten(), 'label': torch.tensor(label, dtype=torch.long)}

dataset = CustomDataset(df['text'].values, df['label'].values, tokenizer)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)

# Simple training loop
for epoch in range(3):
    for batch in dataloader:
        model.train()
        outputs = model(input_ids=batch['input_ids'], attention_mask=batch['attention_mask'], labels=batch['label'])
        loss = outputs.loss
        loss.backward()
        # Optimizer Step etc are omitted
        print(f"Epoch {epoch + 1}, Loss: {loss.item()}")

5.3 Predictions and Evaluation

After the model is trained, we can make predictions on new sentences. Let’s verify this with the following example:

test_text = "This movie is very good."
test_inputs = tokenizer(test_text, return_tensors="pt")

with torch.no_grad():
    test_logits = model(**test_inputs).logits

test_predicted_class = torch.argmax(test_logits, dim=-1).item()
print(f"Predicted class for the test sentence '{test_text}': {test_predicted_class}")

6. Conclusion

In this course, we explored how to install the Mobile BERT model using the Hugging Face Transformers library,
load a pre-trained model, and perform sentence classification tasks. Mobile BERT is a lightweight model, making it useful in mobile environments or resource-constrained settings.
We encourage further research into its applicability to various NLP tasks.

If you found this course helpful, please share it with others! If you have any additional materials or questions, feel free to leave a comment.

Course on Utilizing Hugging Face Transformers, Mobile BERT vs BERT Tokenizer

Introduction

One of the most notable technologies in the field of deep learning and natural language processing (NLP) in recent years is BERT (Bidirectional Encoder Representations from Transformers). BERT demonstrates exceptional performance in understanding context and is used for various NLP tasks. However, due to its large size and high computational cost, it is difficult to use in mobile environments. To solve these issues, Mobile BERT has emerged. In this course, we will compare the characteristics of BERT and Mobile BERT using Hugging Face’s Transformers library, and we will experiment with the Tokenizer of both models.

1. Introduction to BERT Model

BERT is a language representation model announced by Google in 2018, which learns pre-trained language representations to assist with various NLP tasks. BERT is based on the Transformer’s encoder structure and can understand context in a bidirectional manner. Common NLP tasks include sentiment analysis, question-answering systems, and sentence similarity calculation.

1.1 Features of BERT

  • Bidirectional Attention: Understands context in both directions.
  • Masked Language Modeling: Learns by masking certain words in the input sentence and predicting them.
  • Next Sentence Prediction: Predicts whether two sentences are in a consecutive relationship.

2. Introduction to Mobile BERT Model

Mobile BERT is a lightweight version of BERT designed for efficient use on mobile devices. Mobile BERT greatly reduces the number of parameters compared to BERT while maintaining performance. This allows for smooth execution of natural language processing tasks even on mobile devices.

2.1 Features of Mobile BERT

  • Small Model Size: Mobile BERT is a significantly smaller model compared to BERT.
  • High Processing Speed: Thanks to its lightweight structure, it operates quickly even in mobile environments.
  • Efficient Memory Usage: Optimized to achieve high performance with fewer resources.

3. Introduction to Hugging Face Transformers Library

Hugging Face (Transformers) is a Python library that facilitates easy access to various pre-trained NLP models. This library offers a range of models, including BERT, Mobile BERT, and GPT-2. Additionally, it provides Tokenizers for the models to assist with easy text preprocessing.

3.1 Installation Method

pip install transformers torch

4. Mobile BERT vs BERT Tokenizer Usage Example

Now let’s look into the usage of the Tokenizer for BERT and Mobile BERT. The code below installs the Tokenizer for both models and provides an example of tokenizing an input text.

from transformers import BertTokenizer, MobileBertTokenizer

# Initialize BERT Tokenizer
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Initialize Mobile BERT Tokenizer
mobile_bert_tokenizer = MobileBertTokenizer.from_pretrained("google/mobilebert-uncased")

# Input text
text = "Deep learning is a very interesting field."

# BERT Tokenization
bert_tokens = bert_tokenizer.tokenize(text)
print("BERT Tokens:", bert_tokens)

# Mobile BERT Tokenization
mobile_bert_tokens = mobile_bert_tokenizer.tokenize(text)
print("Mobile BERT Tokens:", mobile_bert_tokens)

4.1 Code Explanation

In the example above, we initialized two Tokenizers, BertTokenizer and MobileBertTokenizer, provided by the transformers library. We tokenized the input text using the tokenize method and printed the results. You can compare the tokenization results of BERT and Mobile BERT.

5. Comparative Analysis

Using the Tokenizers of BERT and Mobile BERT, we will compare the tokenization results of the two models and analyze the characteristics of each model. The input sentence used is “Deep learning is a very interesting field.”

# BERT Tokenization Results
BERT Tokens: ['deep', '##learning', 'is', 'a', 'very', 'interesting', 'field', '.']

# Mobile BERT Tokenization Results
Mobile BERT Tokens: ['Deep', 'learning', 'is', 'a', 'very', 'interesting', 'field', '.']

5.1 Analysis

The BERT Tokenizer splits a single word into multiple subwords, while the Mobile BERT Tokenizer keeps the input sentence intact without breaking it into smaller word units. This is because Mobile BERT is optimized to function more efficiently in mobile environments.

6. Advanced Applications

Beyond tokenization and model loading, various advanced tasks utilizing BERT and Mobile BERT models can be performed through the Hugging Face library. For example, you can build sentiment analysis models or perform fine-tuning for specific tasks.

6.1 Model Fine-tuning

Model fine-tuning is the process of retraining a pre-trained model on a specific dataset. The code below shows a basic method for fine-tuning the BERT model.

from transformers import BertForSequenceClassification, Trainer, TrainingArguments
import torch
from torch.utils.data import Dataset, DataLoader

# Example dataset class
class CustomDataset(Dataset):
    def __init__(self, texts, labels, tokenizer):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        input_encoding = self.tokenizer(self.texts[idx], truncation=True, padding='max_length', max_length=512, return_tensors='pt')
        item = {key: val[0] for key, val in input_encoding.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

# Initialize model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Create dataset and DataLoader
train_texts = ["This movie was great.", "This movie was not good."]
train_labels = [1, 0]  # Sentiment labels 1: positive, 0: negative
train_dataset = CustomDataset(train_texts, train_labels, bert_tokenizer)
train_loader = DataLoader(train_dataset, batch_size=2)

# Set TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=2,
    logging_dir='./logs',
)

# Create Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Train the model
trainer.train()

6.2 Code Explanation

The code defines the CustomDataset class to handle input data, loads the BERT model, and then begins training through the Trainer object. This method allows the BERT model to be tailored to specific tasks.

7. Conclusion

In this course, we compared the Tokenizers of BERT and Mobile BERT using Hugging Face’s Transformers library and explored the basic process of model training based on this. While BERT delivers outstanding performance, it demands high-end hardware, whereas Mobile BERT, as a lightweight model, enables natural language processing in mobile environments. We look forward to achieving results in the fields of deep learning and natural language processing through further practice and research.

References

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
  • Sun, C., Qiu, X., Xu, Y., & Huang, X. (2019). MobileBERT: Transformer-based Model for Resource-bound Devices.

Hugging Face Transformers Course, M2M100 Korean Text Tokenization

With the rapid advancements in deep learning, the field of Natural Language Processing (NLP) is undergoing remarkable changes. In particular, the Hugging Face library, which provides various pre-trained models, has been gaining attention recently. Today, I will introduce how to tokenize Korean text using the M2M100 model.

1. Introduction to the M2M100 Model

The M2M100 is a model developed by Facebook AI, designed as a pre-trained transformer model for multilingual translation. This model supports over 100 languages and is intended for translation between various languages. The M2M100 demonstrates excellent translation performance, especially at the sentence level, and can be effectively utilized even for low-resource languages like Korean.

2. Installing the Hugging Face Library

To use the model, you first need to install the Hugging Face Transformers Library. This can be easily installed via pip.

pip install transformers

3. What is Tokenization?

Tokenization is the process of splitting an input sentence into individual units (tokens). Since natural language processing models cannot process text directly, the text needs to be converted into numbers that can be input into the model. This process is called ‘tokenization’.

4. Using the M2M100 Tokenizer

Now, let’s explore how to tokenize Korean text using the M2M100 model. Execute the code below to load the tokenizer and tokenize an example Korean sentence.

from transformers import M2M100Tokenizer

# Load M2M100 tokenizer
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

# Input Korean text
text = "Hello! I am learning deep learning."

# Tokenize the text
tokens = tokenizer.tokenize(text)

# Print the tokenization result
print("Tokenization result:", tokens)

4.1 Code Explanation

The code above loads the tokenizer for the M2M100 model using the M2M100Tokenizer class. Next, it inputs a Korean sentence and calls the tokenizer.tokenize() method to tokenize the sentence.

5. Interpreting the Tokenization Output

The output of tokenization is a list of tokens converted into a format the model can understand. For instance, for the Korean sentence “Hello! I am learning deep learning.”, the tokenization will convert it into a form suitable for input into the model while preserving the meaning of each word.

5.1 Example Output

Expected Output:
Tokenization result: ['Hello', '!', 'I', 'am', 'learning', 'deep', 'learning', '.']

From the output, you can see that the input sentence has been divided into several tokens. “Hello” remains a single token, and “deep learning” is also represented as two tokens.

6. Additional Options: Various Functions of the Tokenizer

The tokenizer provides additional functions beyond simple tokenization, including handling special symbols and positional information. Let’s explore some additional features.

6.1 Padding

To make the length of the input text to the model uniform, padding can be added. Refer to the code below.

# Prepare multiple sentences
texts = ["Hello! I am learning deep learning.", "This lesson utilizes Hugging Face."]

# Tokenize the sentences and add padding
encoded_inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

print("Padded input:", encoded_inputs)

6.2 Decoding

There is also a function to convert the tokenized result back to the original text. This allows you to easily verify the model’s output.

# Decoding the tokenized text
input_ids = encoded_inputs['input_ids'][0]
decoded_text = tokenizer.decode(input_ids)

print("Decoded text:", decoded_text)

7. Conclusion

In this tutorial, we introduced how to effectively tokenize Korean text using the M2M100 model from Hugging Face. The M2M100 model exhibits excellent translation capabilities across various languages and performs well even with low-resource languages like Korean. This enables efficient use of the model in natural language processing applications.

8. References

We hope you join in the advancements in deep learning and natural language processing. Thank you!

Using Hugging Face Transformers for M2M100 Chinese-English Automatic Translation

Recently, with the advancement of artificial intelligence, significant progress has been made in the field of natural language processing. Among them,
Hugging Face’s Transformers library has established itself as a tool that helps easily utilize various language models. In this course, we will
explain in detail how to implement automatic translation between Chinese and English using the M2M100 model with Hugging Face.

1. Introduction to the M2M100 Model

M2M100 is a model for multilingual translation that supports direct conversion between multiple languages. This model excels in handling ‘various
languages’ and supports over 100 languages, offering the advantage of performing direct translations without going through an intermediate language, unlike traditional translation systems.

2. Installation and Setup

To use the M2M100 model, you first need to install the Hugging Face Transformers library and related dependencies. You can install it using the
pip command as shown below.

pip install transformers torch

3. Loading the Model and Implementing the Translation Function

To use the model, you must first load the M2M100 model. The following code is an example of loading the model and tokenizer and implementing a simple function for translation.


from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

# Load model and tokenizer
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

def translate(text, source_language, target_language):
    tokenizer.src_lang = source_language
    encoded_input = tokenizer(text, return_tensors="pt")
    generated_tokens = model.generate(**encoded_input, forced_bos_token_id=tokenizer.get_lang_id(target_language))
    return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]

3.1 Explanation of the Translation Function

The code above works as follows:

  1. tokenizer.src_lang: Sets the source language.
  2. tokenizer(): Tokenizes the input text.
  3. model.generate(): Performs translation based on the tokenized input.
  4. tokenizer.batch_decode(): Decodes the generated tokens and returns the translated text.

4. Translation Examples

Now, let’s test the translation functionality. The example below demonstrates translating a Chinese sentence into English.


# Sentence to be translated
text = "你好,世界!"  # Hello, World!
source_lang = "zh"  # Chinese
target_lang = "en"  # English

# Perform translation
translated_text = translate(text, source_lang, target_lang)
print(f"Translation result: {translated_text}")

4.1 Interpretation of the Results

When the above code is executed, the output will be the English sentence “Hello, World!”. The M2M100 model can effectively translate even languages
with relatively complex sentence structures.

5. Multilingual Translation Examples

One of the powerful features of the M2M100 model is its support for multiple languages. The example below performs translation between various languages
including Korean, French, and Spanish.


# Multilingual translation test
samples = [
    {"text": "여러 언어를 지원하는 모델입니다.", "source": "ko", "target": "en"},  # Korean to English
    {"text": "Bonjour le monde!", "source": "fr", "target": "ko"},  # French to Korean
    {"text": "¡Hola Mundo!", "source": "es", "target": "ja"},  # Spanish to Japanese
]

for sample in samples:
    translated = translate(sample["text"], sample["source"], sample["target"])
    print(f"{sample['text']} ({sample['source']}) -> {translated} ({sample['target']})")

5.1 Multilingual Translation Results

Running the code above will output translations between several languages. The important point is that the M2M100 model can translate various languages
directly without going through an intermediate language.

6. Performance Evaluation

To evaluate the quality of translations, the BLEU (Bilingual Evaluation Understudy) score can be used. The BLEU score quantitatively measures the
similarity between the generated translation and the reference translation. The following is the process to calculate the BLEU score.


from nltk.translate.bleu_score import sentence_bleu

# Reference translation and system translation
reference = ["Hello", "World"]
candidate = translated_text.split()

# Calculate BLEU score
bleu_score = sentence_bleu([reference], candidate)
print(f"BLEU score: {bleu_score:.4f}")

6.1 Interpretation of Performance Evaluation

A BLEU score close to 0 indicates poor translation, while a score close to 1 indicates high quality of translation.
Various examples and reference translations can be used to evaluate the translation performance across multiple languages.

7. Conclusion

Hugging Face’s M2M100 model is a model that has achieved innovative advancements in the field of multilingual translation.
In this course, we explored a basic example of automatic translation between Chinese and English using the M2M100 model. This model is capable of direct language conversion, allowing translations between various languages without an intermediate language.

In the future, try experimenting with more languages and complex sentences to further improve this model’s performance and find ways to leverage it. The Hugging Face Transformers library can be applied to various NLP tasks, so feel free to apply it to different projects.

The Hugging Face Transformers Practical Course, M2M100 Korean English Automatic Translation

This course provides a detailed explanation of how to perform automatic translation between Korean and English using the M2M100 model utilizing Hugging Face’s Transformers library. M2M100 is a model that supports multilingual translation and enables translation between over 100 languages. This article will outline the overview of M2M100, installation methods, data preparation, model loading, and the prediction process step by step.

1. Overview of the M2M100 Model

The M2M100 (Multilingual to Multilingual) model is a multilingual machine translation model developed by Facebook AI Research, which provides the ability to translate among more than 100 languages. The key advantages of M2M100 include:

  • Multilingual Support: Capable of translating between various languages such as English, Korean, Chinese, French, and more.
  • Diverse Language Pairs: Supports mutual translation through a pre-trained network.
  • Ease of Use: Can be easily implemented and utilized through Hugging Face’s Transformers library.

2. Environment Setup and Installation

Here is how to install the necessary libraries and packages to use the M2M100 model. Follow the steps below to set up the environment.

pip install transformers torch

This command installs Hugging Face’s Transformers library and PyTorch. We will use PyTorch as the default. Once the installation is complete, you are ready to use the M2M100 model with the following code.

3. Loading the M2M100 Model

To load the M2M100 model, write the following code snippets.


from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

# Load model and tokenizer
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
    

The code above loads the necessary model and tokenizer from Hugging Face’s hub.

4. Data Preparation

Prepare a sentence for translation. For example, consider translating an English sentence into Korean or a Korean sentence into English. The code below shows an example of preparing a sentence for translation.


# Sentence to translate
text_to_translate = "Deep learning is a field of artificial intelligence that enables computers to learn from data."
    

Now let’s translate this sentence using the M2M100 model.

5. Performing Translation

Translation is performed based on the input sentence provided to the model. After preparing the model’s input, the process involves tokenization and making predictions through the model.


# Tokenization
tokenized_input = tokenizer(text_to_translate, return_tensors="pt", padding=True)

# Performing translation
with torch.no_grad():
    generated_ids = model.generate(
        tokenized_input['input_ids'], 
        forced_bos_token_id=tokenizer.get_lang_id('ko') # Translated into Korean
    )

# Decoding
translated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"Translation Result: {translated_text}")
    

After executing the code above, you can see the following result.

6. Checking the Translation Result

The translation result is a Korean sentence generated by the model. The performance of the M2M100 model may vary depending on the sentence to be translated, but generally provides high-quality translations.

7. Translating Multiple Sentences

It is very easy to translate multiple sentences simultaneously using the M2M100 model. You can prepare several sentences as shown below and translate them through a loop.


# Translating multiple sentences
sentences_to_translate = [
    "AI has established itself as one of the greatest technologies of the 21st century.",
    "The transformer architecture has brought innovation to natural language processing."
]

for sentence in sentences_to_translate:
    input_ids = tokenizer(sentence, return_tensors="pt", padding=True)['input_ids']
    with torch.no_grad():
        output_ids = model.generate(input_ids, forced_bos_token_id=tokenizer.get_lang_id('ko'))
    translated_sentence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    print(f"{sentence} -> {translated_sentence}")
    

You can easily translate multiple sentences in this way.

8. Performance Evaluation

Methods such as BleU score and METEOR score are used to evaluate the performance of automatic translation. These evaluation methods allow for a quantitative assessment of the model’s translation quality. So far, we will discuss common evaluation methods in the field of natural language processing.

9. Conclusion

In this course, we covered how to perform automatic translation using the M2M100 model with Hugging Face’s Transformers library. Efficiently utilizing deep learning models, you can build various natural language processing applications. We look forward to further advancements through diverse deep learning models and technologies in the future.

References