Recently, with the advancement of artificial intelligence, significant progress has been made in the field of natural language processing. Among them,
Hugging Face’s Transformers library has established itself as a tool that helps easily utilize various language models. In this course, we will
explain in detail how to implement automatic translation between Chinese and English using the M2M100 model with Hugging Face.
1. Introduction to the M2M100 Model
M2M100 is a model for multilingual translation that supports direct conversion between multiple languages. This model excels in handling ‘various
languages’ and supports over 100 languages, offering the advantage of performing direct translations without going through an intermediate language, unlike traditional translation systems.
2. Installation and Setup
To use the M2M100 model, you first need to install the Hugging Face Transformers library and related dependencies. You can install it using the
pip
command as shown below.
pip install transformers torch
3. Loading the Model and Implementing the Translation Function
To use the model, you must first load the M2M100 model. The following code is an example of loading the model and tokenizer and implementing a simple function for translation.
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
# Load model and tokenizer
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)
def translate(text, source_language, target_language):
tokenizer.src_lang = source_language
encoded_input = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(**encoded_input, forced_bos_token_id=tokenizer.get_lang_id(target_language))
return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
3.1 Explanation of the Translation Function
The code above works as follows:
tokenizer.src_lang
: Sets the source language.tokenizer()
: Tokenizes the input text.model.generate()
: Performs translation based on the tokenized input.tokenizer.batch_decode()
: Decodes the generated tokens and returns the translated text.
4. Translation Examples
Now, let’s test the translation functionality. The example below demonstrates translating a Chinese sentence into English.
# Sentence to be translated
text = "你好,世界!" # Hello, World!
source_lang = "zh" # Chinese
target_lang = "en" # English
# Perform translation
translated_text = translate(text, source_lang, target_lang)
print(f"Translation result: {translated_text}")
4.1 Interpretation of the Results
When the above code is executed, the output will be the English sentence “Hello, World!”. The M2M100 model can effectively translate even languages
with relatively complex sentence structures.
5. Multilingual Translation Examples
One of the powerful features of the M2M100 model is its support for multiple languages. The example below performs translation between various languages
including Korean, French, and Spanish.
# Multilingual translation test
samples = [
{"text": "여러 언어를 지원하는 모델입니다.", "source": "ko", "target": "en"}, # Korean to English
{"text": "Bonjour le monde!", "source": "fr", "target": "ko"}, # French to Korean
{"text": "¡Hola Mundo!", "source": "es", "target": "ja"}, # Spanish to Japanese
]
for sample in samples:
translated = translate(sample["text"], sample["source"], sample["target"])
print(f"{sample['text']} ({sample['source']}) -> {translated} ({sample['target']})")
5.1 Multilingual Translation Results
Running the code above will output translations between several languages. The important point is that the M2M100 model can translate various languages
directly without going through an intermediate language.
6. Performance Evaluation
To evaluate the quality of translations, the BLEU (Bilingual Evaluation Understudy) score can be used. The BLEU score quantitatively measures the
similarity between the generated translation and the reference translation. The following is the process to calculate the BLEU score.
from nltk.translate.bleu_score import sentence_bleu
# Reference translation and system translation
reference = ["Hello", "World"]
candidate = translated_text.split()
# Calculate BLEU score
bleu_score = sentence_bleu([reference], candidate)
print(f"BLEU score: {bleu_score:.4f}")
6.1 Interpretation of Performance Evaluation
A BLEU score close to 0 indicates poor translation, while a score close to 1 indicates high quality of translation.
Various examples and reference translations can be used to evaluate the translation performance across multiple languages.
7. Conclusion
Hugging Face’s M2M100 model is a model that has achieved innovative advancements in the field of multilingual translation.
In this course, we explored a basic example of automatic translation between Chinese and English using the M2M100 model. This model is capable of direct language conversion, allowing translations between various languages without an intermediate language.
In the future, try experimenting with more languages and complex sentences to further improve this model’s performance and find ways to leverage it. The Hugging Face Transformers library can be applied to various NLP tasks, so feel free to apply it to different projects.