Deep Learning PyTorch Course, Count Prediction Based Embedding

This article explores the field of deep learning known as embedding, and provides a detailed explanation of count-based and prediction-based embedding techniques. Additionally, an example code implementing these techniques using the PyTorch library will be provided.

1. What is Embedding?

Embedding refers to the method of converting high-dimensional data into lower dimensions while preserving meaning. It is commonly used in natural language processing (NLP) and recommendation systems. For example, embedding techniques are used to represent words as vectors to calculate semantic similarity between words. Embeddings can take various forms, and this article will explain the two main methods: count-based embedding and prediction-based embedding.

2. Count-Based Embedding

Count-based embedding is a method of embedding based on the frequency of occurrence of specific data. The most representative examples include TF-IDF (vectorization) and Bag of Words (BOW). These methods identify the characteristics of documents based on the frequency of word occurrences.

2.1. Explanation of TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word. TF indicates how frequently a specific word appears in a document, while IDF indicates how rarely a specific word appears in a large number of documents.

2.2. Implementing TF-IDF with PyTorch

Below is a simple example of TF-IDF calculation using PyTorch.


import torch
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Sample text data
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this document is the third document.",
    "The document ends here."
]

# TF-IDF vectorization
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
tfidf_array = tfidf_matrix.toarray()

# Output results
print("Word list:", vectorizer.get_feature_names_out())
print("TF-IDF matrix:\n", tfidf_array)

The above code calculates the frequency of word occurrences in each document through TF-IDF vectorization. As a result, it outputs the word list and the TF-IDF matrix for each document.

3. Prediction-Based Embedding

Prediction-based embedding is a method of learning embeddings for words or items through deep learning models. Techniques such as Word2Vec and GloVe are representative. This method learns the embedding of a specific word based on its surrounding words, resulting in embeddings that have closer distances between semantically similar words.

3.1. Explanation of Word2Vec

Word2Vec is a representative prediction-based embedding technique that maps words to a vector space and provides two models: Continuous Bag of Words (CBOW) and Skip-Gram. The CBOW model uses the surrounding words of a given word to predict that word, while the Skip-Gram model predicts the surrounding words from a given word.

3.2. Implementing Word2Vec with PyTorch

Below is an example of implementing the Skip-Gram model using PyTorch.


import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter

# Define a function to prepare sample data
def prepare_data(documents):
    words = [word for doc in documents for word in doc.split()]
    word_counts = Counter(words)
    vocabulary_size = len(word_counts)
    word2idx = {words: i for i, words in enumerate(word_counts.keys())}
    idx2word = {i: words for words, i in word2idx.items()}
    return word2idx, idx2word, vocabulary_size

# Define the Skip-Gram model
class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(SkipGramModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)

    def forward(self, center_word):
        return self.embedding(center_word)

# Settings and data preparation
documents = [
    "This is the first document",
    "This document is the second document",
    "And this document is the third document"
]
word2idx, idx2word, vocab_size = prepare_data(documents)

# Model setup and training
embed_size = 10
model = SkipGramModel(vocab_size, embed_size)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

# Example input
input_word = torch.tensor([word2idx['This is']])
target_word = torch.tensor([word2idx['first']])

# Training process (1 epoch example)
for epoch in range(1):
    model.zero_grad()
    # Prediction
    predictions = model(input_word)
    # Calculate loss
    loss = loss_function(predictions.view(1, -1), target_word)
    loss.backward()
    optimizer.step()
    
# Output results
print("Embedding vector of the word 'This is':\n", model.embedding.weight[word2idx['This is']].detach().numpy())

The above code implements the Skip-Gram model simply using PyTorch. It learns embeddings for each word and outputs the embedding vector for a specific word.

4. Conclusion

In this article, we explored the concept of embedding along with count-based and prediction-based embedding techniques. Count-based methods like TF-IDF are based on the frequency of data occurrences, while prediction-based methods like Word2Vec learn the meanings of words through deep learning models. We learned the characteristics of each embedding technique and the process of applying them through practical examples.

In deep learning, understanding the characteristics of data and selecting embedding techniques based on that is crucial, as it can significantly enhance the performance of the model. In upcoming content, we plan to discuss how to expand these techniques to implement more complex models, so please stay tuned.