The Hugging Face Transformers Practical Course, M2M100 Korean English Automatic Translation

This course provides a detailed explanation of how to perform automatic translation between Korean and English using the M2M100 model utilizing Hugging Face’s Transformers library. M2M100 is a model that supports multilingual translation and enables translation between over 100 languages. This article will outline the overview of M2M100, installation methods, data preparation, model loading, and the prediction process step by step.

1. Overview of the M2M100 Model

The M2M100 (Multilingual to Multilingual) model is a multilingual machine translation model developed by Facebook AI Research, which provides the ability to translate among more than 100 languages. The key advantages of M2M100 include:

  • Multilingual Support: Capable of translating between various languages such as English, Korean, Chinese, French, and more.
  • Diverse Language Pairs: Supports mutual translation through a pre-trained network.
  • Ease of Use: Can be easily implemented and utilized through Hugging Face’s Transformers library.

2. Environment Setup and Installation

Here is how to install the necessary libraries and packages to use the M2M100 model. Follow the steps below to set up the environment.

pip install transformers torch

This command installs Hugging Face’s Transformers library and PyTorch. We will use PyTorch as the default. Once the installation is complete, you are ready to use the M2M100 model with the following code.

3. Loading the M2M100 Model

To load the M2M100 model, write the following code snippets.


from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

# Load model and tokenizer
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
    

The code above loads the necessary model and tokenizer from Hugging Face’s hub.

4. Data Preparation

Prepare a sentence for translation. For example, consider translating an English sentence into Korean or a Korean sentence into English. The code below shows an example of preparing a sentence for translation.


# Sentence to translate
text_to_translate = "Deep learning is a field of artificial intelligence that enables computers to learn from data."
    

Now let’s translate this sentence using the M2M100 model.

5. Performing Translation

Translation is performed based on the input sentence provided to the model. After preparing the model’s input, the process involves tokenization and making predictions through the model.


# Tokenization
tokenized_input = tokenizer(text_to_translate, return_tensors="pt", padding=True)

# Performing translation
with torch.no_grad():
    generated_ids = model.generate(
        tokenized_input['input_ids'], 
        forced_bos_token_id=tokenizer.get_lang_id('ko') # Translated into Korean
    )

# Decoding
translated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"Translation Result: {translated_text}")
    

After executing the code above, you can see the following result.

6. Checking the Translation Result

The translation result is a Korean sentence generated by the model. The performance of the M2M100 model may vary depending on the sentence to be translated, but generally provides high-quality translations.

7. Translating Multiple Sentences

It is very easy to translate multiple sentences simultaneously using the M2M100 model. You can prepare several sentences as shown below and translate them through a loop.


# Translating multiple sentences
sentences_to_translate = [
    "AI has established itself as one of the greatest technologies of the 21st century.",
    "The transformer architecture has brought innovation to natural language processing."
]

for sentence in sentences_to_translate:
    input_ids = tokenizer(sentence, return_tensors="pt", padding=True)['input_ids']
    with torch.no_grad():
        output_ids = model.generate(input_ids, forced_bos_token_id=tokenizer.get_lang_id('ko'))
    translated_sentence = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    print(f"{sentence} -> {translated_sentence}")
    

You can easily translate multiple sentences in this way.

8. Performance Evaluation

Methods such as BleU score and METEOR score are used to evaluate the performance of automatic translation. These evaluation methods allow for a quantitative assessment of the model’s translation quality. So far, we will discuss common evaluation methods in the field of natural language processing.

9. Conclusion

In this course, we covered how to perform automatic translation using the M2M100 model with Hugging Face’s Transformers library. Efficiently utilizing deep learning models, you can build various natural language processing applications. We look forward to further advancements through diverse deep learning models and technologies in the future.

References