Hugging Face Transformers Tutorial: Preparing Chinese Text for M2M100 Translation

Written on: October 10, 2023

1. Introduction

Recently, natural language processing (NLP) has gained significant attention in the field of artificial intelligence. In particular, translation technology has established itself as a crucial element that enables global communication. This article discusses how to prepare and translate Chinese text using the M2M100 model based on Hugging Face’s Transformers library. The M2M100 model is a multilingual translation model that supports translation between various languages and shows strengths in handling complex languages like Chinese.

2. Overview of the M2M100 Model

The M2M100 model is a multilingual machine translation model developed by Facebook AI Research (Facebook AI). This model is designed to translate between more than 100 languages and is based on the Transformer architecture. What is remarkable is that M2M100 can directly translate without relying on specific language pairs. In other words, it can translate Chinese directly without mediating through English.

M2M100 consists of two main components: the encoder and the decoder. The encoder converts the input sentence into a numerical vector, and the decoder generates the output sentence based on this vector. This process takes place through the Transformer architecture, and the encoder-decoder architecture plays a crucial role in machine translation systems.

3. Installation and Setup

To proceed with this tutorial, Python and several essential libraries must be installed. We will use Hugging Face’s Transformers library and PyTorch. Here’s how to install them:

                
                    pip install transformers torch

Enter the above command in the terminal to install the necessary libraries.

4. Preparing the Dataset

To train a translation model, an appropriate dataset is necessary. In this project, we will prepare Chinese sentences for use. Here is how to create a dataset containing simple Chinese sentences.

                
                    # List of Chinese sentences
                    chinese_sentences = [
                        "Hello, world!",
                        "The weather is nice today.",
                        "I like to study deep learning.",
                        "Artificial intelligence is changing our lives.",
                        "What do you want to eat?",
                    ]

The above code defines five simple Chinese sentences in a list format. In actual projects, a larger dataset is required.

5. Loading the Model and Translation

Now, let’s perform translation using the M2M100 model with the prepared dataset. The model can be easily loaded through Hugging Face’s Transformers library. Here is an example of translating Chinese sentences using the M2M100 model.

                
                    from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

                    # Load the model and tokenizer
                    model_name = "facebook/m2m100_418M"
                    model = M2M100ForConditionalGeneration.from_pretrained(model_name)
                    tokenizer = M2M100Tokenizer.from_pretrained(model_name)

                    def translate(text, target_lang="en"):
                        tokenizer.src_lang = "zh"  # Set the source language to Chinese
                        encoded = tokenizer(text, return_tensors="pt")
                        generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.get_lang_id(target_lang))
                        return tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

                    # Perform translation
                    for sentence in chinese_sentences:
                        translated = translate(sentence)
                        print(f"Original: {sentence} -> Translated: {translated[0]}")

In the above code, after loading the model and tokenizer, a function that translates each sentence is defined. The `translate` function sets the source language to Chinese (`zh`) and outputs the translated sentence into the desired target language.

6. Checking the Output Results

The results of running the above code are as follows:

                
                    Original: Hello, world! -> Translated: Hello, world!
                    Original: The weather is nice today. -> Translated: The weather is nice today.
                    Original: I like to study deep learning. -> Translated: I like to study deep learning.
                    Original: Artificial intelligence is changing our lives. -> Translated: Artificial intelligence is changing our lives.
                    Original: What do you want to eat? -> Translated: What do you want to eat?

As shown, each sentence has been successfully translated. The translation results may vary depending on the model’s performance, and this example demonstrates the process of translating Chinese sentences into English using the M2M100 model.

7. Practice and Applications

Language translation can be applied in various fields. For example, it can be utilized in translating multinational corporate websites, travel guides, customer support services, etc. Additionally, the M2M100 model supports various language pairs, allowing for direct translation without a mediating language, resulting in more natural outcomes.

As an additional practice, you can perform translations into other languages (e.g., Korean, Japanese). To do this, simply change the value of the `target_lang` parameter in the `translate` function to the desired language. The code below shows how to translate into Korean.

                
                    # Translating to Korean
                    for sentence in chinese_sentences:
                        translated = translate(sentence, target_lang="ko")
                        print(f"Original: {sentence} -> Translated: {translated[0]}")

8. Conclusion

In this tutorial, we learned how to prepare and translate Chinese sentences using the M2M100 model with Hugging Face’s Transformers library. Translation technology is expected to continue developing, with various models and algorithms being researched and developed. Utilize deep learning models like these to enhance the efficiency of multilingual translation.

I hope this article deepens your understanding of deep learning and natural language processing, and I encourage you to try it out yourself. We will cover more deep learning techniques and practices in the next tutorial, so please stay tuned.