Using Hugging Face Transformers Course, Source Language M2M100 Tokenization

With the development of deep learning, Natural Language Processing (NLP) has undergone significant changes. In particular, Hugging Face’s transformers library has established itself as a powerful tool for NLP tasks. In this course, we will introduce the multilingual translation and tokenization process using the M2M100 model.

Overview of the M2M100 Model

The M2M100 (Multilingual To Multilingual Translation) model is a multilingual model that supports direct translation between more than 100 languages. Existing translation models used an indirect translation method that translated from the source language to an intermediate language (e.g., English) and then converted to the target language. The M2M100 overcomes this limitation by enabling direct conversion among multiple languages, significantly improving translation efficiency between various language pairs.

What is Tokenization?

Tokenization is the process of dividing the input text into smaller units called tokens. After converting it into a list format, unique indices are assigned to each token. Tokenization is an essential process in NLP and is necessary before inputting text data into the model.

Environment Setup

Before proceeding with the course, you need to install the required libraries. Specifically, we will install transformers and torch. You can install them with the following command:

        pip install transformers torch

Loading the Tokenizer

To load the tokenizer for the M2M100 model, we will use the M2M100Tokenizer class provided by the transformers library.

        
import torch
from transformers import M2M100Tokenizer

# Load the tokenizer for the M2M100 model
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

Tokenization Process

Now we are ready to tokenize the text. Below is an example of tokenizing the sentence ‘Hello, everyone!’.

        
# Input text
text = "Hello, everyone!"

# Tokenizing the text
encoded_input = tokenizer(text, return_tensors="pt")

# Output tokens and indices
print("Tokenized tokens:", tokenizer.convert_ids_to_tokens(encoded_input['input_ids'][0]))
print("Token indices:", encoded_input['input_ids'])

Tokenization Result

The output generated after running the above code shows how the input text has been tokenized and indexed. You can check the actual values of the tokens using the convert_ids_to_tokens method.

Multilingual Translation

Using the tokenized data, we can perform multilingual translation. I will show you an example of translating Korean to English using the M2M100 model.

        
from transformers import M2M100ForConditionalGeneration

# Load the M2M100 model
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")

# Korean text
text = "Hello, everyone!"
encoded_input = tokenizer(text, return_tensors="pt")

# Translation
translated_tokens = model.generate(**encoded_input, forced_bos_token_id=tokenizer.get_lang_id("en"))

# Translation result
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
print("Translation result:", translated_text[0])

Interpretation of the Translation Result

You can check if the Korean sentence has been accurately translated into English using the code above. The generate method generates the translated result based on the input token data.

Conclusion

In this course, we explored the multilingual tokenization and translation process using Hugging Face’s M2M100 model. The progress in the field of natural language processing will continue, and using such tools will enable better communication across various languages. I hope that interest and research in NLP and deep learning will continue in the future.