Hugging Face Transformers Course, M2M100 Korean Text Tokenization

With the rapid advancements in deep learning, the field of Natural Language Processing (NLP) is undergoing remarkable changes. In particular, the Hugging Face library, which provides various pre-trained models, has been gaining attention recently. Today, I will introduce how to tokenize Korean text using the M2M100 model.

1. Introduction to the M2M100 Model

The M2M100 is a model developed by Facebook AI, designed as a pre-trained transformer model for multilingual translation. This model supports over 100 languages and is intended for translation between various languages. The M2M100 demonstrates excellent translation performance, especially at the sentence level, and can be effectively utilized even for low-resource languages like Korean.

2. Installing the Hugging Face Library

To use the model, you first need to install the Hugging Face Transformers Library. This can be easily installed via pip.

pip install transformers

3. What is Tokenization?

Tokenization is the process of splitting an input sentence into individual units (tokens). Since natural language processing models cannot process text directly, the text needs to be converted into numbers that can be input into the model. This process is called ‘tokenization’.

4. Using the M2M100 Tokenizer

Now, let’s explore how to tokenize Korean text using the M2M100 model. Execute the code below to load the tokenizer and tokenize an example Korean sentence.

from transformers import M2M100Tokenizer

# Load M2M100 tokenizer
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

# Input Korean text
text = "Hello! I am learning deep learning."

# Tokenize the text
tokens = tokenizer.tokenize(text)

# Print the tokenization result
print("Tokenization result:", tokens)

4.1 Code Explanation

The code above loads the tokenizer for the M2M100 model using the M2M100Tokenizer class. Next, it inputs a Korean sentence and calls the tokenizer.tokenize() method to tokenize the sentence.

5. Interpreting the Tokenization Output

The output of tokenization is a list of tokens converted into a format the model can understand. For instance, for the Korean sentence “Hello! I am learning deep learning.”, the tokenization will convert it into a form suitable for input into the model while preserving the meaning of each word.

5.1 Example Output

Expected Output:
Tokenization result: ['Hello', '!', 'I', 'am', 'learning', 'deep', 'learning', '.']

From the output, you can see that the input sentence has been divided into several tokens. “Hello” remains a single token, and “deep learning” is also represented as two tokens.

6. Additional Options: Various Functions of the Tokenizer

The tokenizer provides additional functions beyond simple tokenization, including handling special symbols and positional information. Let’s explore some additional features.

6.1 Padding

To make the length of the input text to the model uniform, padding can be added. Refer to the code below.

# Prepare multiple sentences
texts = ["Hello! I am learning deep learning.", "This lesson utilizes Hugging Face."]

# Tokenize the sentences and add padding
encoded_inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

print("Padded input:", encoded_inputs)

6.2 Decoding

There is also a function to convert the tokenized result back to the original text. This allows you to easily verify the model’s output.

# Decoding the tokenized text
input_ids = encoded_inputs['input_ids'][0]
decoded_text = tokenizer.decode(input_ids)

print("Decoded text:", decoded_text)

7. Conclusion

In this tutorial, we introduced how to effectively tokenize Korean text using the M2M100 model from Hugging Face. The M2M100 model exhibits excellent translation capabilities across various languages and performs well even with low-resource languages like Korean. This enables efficient use of the model in natural language processing applications.

8. References

We hope you join in the advancements in deep learning and natural language processing. Thank you!