Hugging Face Transformers Tutorial: M2M100 Library Installation and Loading Pre-trained Models

In this course, we will learn in detail how to install the M2M100 model from Hugging Face’s Transformers library and load a pretrained model. The M2M100 model is designed for multilingual translation and supports translation between various languages.

1. What is Hugging Face Transformers?

Hugging Face Transformers is one of the most popular libraries in the field of Natural Language Processing (NLP), providing a variety of pretrained models to help developers use them easily. These models are specifically designed for various NLP tasks, such as BERT, GPT-2, T5, and M2M100.

2. Introduction to the M2M100 Model

The M2M100 model is designed for multilingual translation and supports over 100 languages. Its innovative feature is that it can perform direct translation between multiple languages without an intermediary language. This approach can improve translation quality.

3. Installing M2M100

To use the M2M100 model, you must first install the Hugging Face Transformers library. You can install the library using the following command.

pip install transformers

3.1 Verifying Installation

Once the installation is complete, run the Python code below to verify that it has been installed correctly.


import transformers
print(transformers.__version__)

4. Loading a Pretrained Model

To use the M2M100 model, you can easily load a pretrained model from the installed library. The code below explains the steps to load the M2M100 model.


from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

# Loading the tokenizer and model
model_name = 'facebook/m2m100_418M'
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# Text to be translated
text = "Hello, in this article, we will learn about the multilingual translation model using Hugging Face Transformers."

# Setting the input language
tokenizer.src_lang = "en"
encoded_text = tokenizer(text, return_tensors="pt")

# Performing translation
translated_tokens = model.generate(**encoded_text, forced_bos_token_id=tokenizer.get_lang_id("ko"))
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
print(translated_text)

4.1 Code Explanation

The above code consists of the following steps:

Loading the tokenizer and model: The M2M100 model and its corresponding tokenizer are loaded.
Defining the text to be translated: The English sentence that you want to translate is defined.
Setting the input language: The properties of the tokenizer are set to specify the input language as English.
Performing translation: The model performs the translation, and the result is decoded to output the final translated text.

5. Translation Between Different Languages

Now, let’s attempt to translate into another language. For example, we will translate from English to French.


# English sentence to be translated
text_en = "Hello, in this article, we will learn about the M2M100 model from Hugging Face Transformers."
tokenizer.src_lang = "en"
encoded_text_en = tokenizer(text_en, return_tensors="pt")

# Translating to French
translated_tokens_fr = model.generate(**encoded_text_en, forced_bos_token_id=tokenizer.get_lang_id("fr"))
translated_text_fr = tokenizer.batch_decode(translated_tokens_fr, skip_special_tokens=True)[0]
print(translated_text_fr)

6. Conclusion

Through this course, we learned how to install the M2M100 model from the Hugging Face Transformers library and how to load a pretrained model. This powerful model for multilingual translation is very useful in improving the quality of translation between various languages. I encourage you to explore more NLP tasks in the future.