Deep Learning PyTorch Course, Korean Embedding

With the advancement of deep learning, many innovations have also been made in the field of Natural Language Processing (NLP). In particular, embedding, which is a vector representation of language, plays an important role in deep learning models. In this article, we will explain in detail how to implement Korean embedding using PyTorch.

1. What is Embedding?

Embedding is the process of converting words or sentences into vectors in high-dimensional space, making them understandable for machine learning models. This allows for the reflection of similarities between words. For example, the embedding vectors for ‘king’ and ‘queen’ will be located close to each other.

2. Korean Natural Language Processing

Korean is composed of various morphemes, making natural language processing more complex compared to languages like English. To address this, a Korean morphological analyzer can be used. Representative morphological analyzers include KoNLPy, mecab, and khaiii.

2.1 Installing and Using KoNLPy

KoNLPy is a library that helps you easily perform Korean natural language processing. Below are the installation method and basic usage of KoNLPy.

!pip install konlpy

2.2 Basic Usage Example

from konlpy.tag import Okt

okt = Okt()
text = "Deep learning is a field of artificial intelligence."
print(okt.morphs(text))  # Morphological analysis
print(okt.nouns(text))   # Noun extraction
print(okt.phrases(text))  # Phrase extraction
    

3. Implementing Embedding with PyTorch

Now we are ready to build a model, process Korean data, and execute the embedding.

3.1 Preparing the Dataset

We will prepare the text data. Here, we will use a simple list of Korean sentences.

sentences = [
    "Hello",
    "Deep learning is fun.",
    "You can learn machine learning using Python.",
    "Artificial intelligence is our future."
]
    

3.2 Text Preprocessing

We will use a morphological analyzer to extract words and prepare to create embeddings from them.

from collections import Counter
import numpy as np

# Morphological analysis
def preprocess(sentences):
    okt = Okt()
    tokens = [okt.morphs(sentence) for sentence in sentences]
    return tokens

tokens = preprocess(sentences)

# Create word set
flat_list = [item for sublist in tokens for item in sublist]
word_counter = Counter(flat_list)
word_vocab = {word: i + 1 for i, (word, _) in enumerate(word_counter.most_common())}  # 0 is reserved for padding
    

3.3 Configuring the PyTorch DataLoader

We will utilize PyTorch’s DataLoader to generate word vectors.

import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, tokens, word_vocab):
        self.tokens = tokens
        self.word_vocab = word_vocab

    def __len__(self):
        return len(self.tokens)

    def __getitem__(self, idx):
        sentence = self.tokens[idx]
        return torch.tensor([self.word_vocab[word] for word in sentence], dtype=torch.long)

dataset = CustomDataset(tokens, word_vocab)
dataloader = DataLoader(dataset, batch_size=2, shuffle=True)
    

3.4 Building the Embedding Model

Now we will build a model that includes an embedding layer.

import torch.nn as nn

class WordEmbeddingModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(WordEmbeddingModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, input):
        return self.embeddings(input)

embedding_dim = 5
model = WordEmbeddingModel(vocab_size=len(word_vocab) + 1, embedding_dim=embedding_dim)
    

3.5 Training the Embedding

To train the model, we will set up a loss function and optimizer.

loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training for just 5 epochs as a simple example
for epoch in range(5):
    for i, data in enumerate(dataloader):
        model.zero_grad()
        output = model(data)
        label = data.view(-1)  # Setting the label (using the same word as an example)
        loss = loss_function(output.view(-1, len(word_vocab) + 1), label)
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch + 1}, Loss: {loss.item()}")
    

3.6 Visualizing the Embedding Results

We can visualize the embedding results to intuitively understand the relationships between words. Here, we will use t-SNE to visualize in 2D.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def visualize_embeddings(model, word_vocab):
    embeddings = model.embeddings.weight.data.numpy()
    words = list(word_vocab.keys())

    tsne = TSNE(n_components=2)
    embeddings_2d = tsne.fit_transform(embeddings)

    plt.figure(figsize=(10, 10))
    for i, word in enumerate(words):
        plt.scatter(embeddings_2d[i, 0], embeddings_2d[i, 1])
        plt.annotate(word, (embeddings_2d[i, 0], embeddings_2d[i, 1]), fontsize=9)
    plt.show()

visualize_embeddings(model, word_vocab)
    

4. Conclusion

This article covered the process of implementing Korean embedding using PyTorch. Embedding plays an important role in natural language processing and requires preprocessing tailored to the characteristics of various languages. In the future, it is recommended to conduct in-depth research on more complex models and datasets.

I hope this lecture helps improve your understanding of deep learning and natural language processing. If you have any questions, please leave a comment!