Deep Learning for Natural Language Processing, Implementation of Word2Vec using Negative Sampling (Skip-Gram with Negative Sampling, SGNS)

Natural Language Processing (NLP) refers to the technology that allows computers to understand and process human language. In recent years, the performance of natural language processing has significantly improved due to advancements in deep learning technology. This article will take a detailed look at one technique of natural language processing utilizing deep learning, which is the Skip-Gram model of Word2Vec and its implementation method, Negative Sampling.

1. Basics of Natural Language Processing

Natural language processing is the process of understanding various characteristics of language and transforming words, sentences, contexts, etc. into a form that computers can recognize. Various technologies are used for this purpose, among which the technology that converts the meaning of words into vector forms is important.

2. Concept of Word2Vec

Word2Vec is an algorithm that converts words into vectors, representing semantically similar words as similar vectors. This allows machines to better understand the meanings of languages. There are primarily two models in Word2Vec: Continuous Bag of Words (CBOW) and Skip-Gram model.

2.1 Continuous Bag of Words (CBOW)

CBOW model predicts the center word through the given surrounding words. For example, in the sentence “The cat sits on the mat”, “sits” would be predicted using “The”, “cat”, “on”, “the”, “mat” as surrounding words.

2.2 Skip-Gram Model

Skip-Gram model is the opposite concept of CBOW, predicting surrounding words from a given center word. This model is particularly effective for learning rare words and captures words that are semantically related well.

3. Negative Sampling

Skip-Gram model of Word2Vec has a significant computational complexity as it needs to learn a large number of words. To reduce this complexity, negative sampling is introduced. Negative sampling involves randomly selecting some words (negative samples) from the overall word distribution to accelerate the loss function.

3.1 Principle of Negative Sampling

The core idea of negative sampling is to mix positive samples (matching words) and negative samples (non-matching words) to train the model. This approach enables a better understanding of the relationships between words that have similar probability distributions.

4. Implementing Skip-Gram with Negative Sampling (SGNS)

This section explains the overall structure and implementation method of SGNS, which combines the Skip-Gram model with negative sampling.

4.1 Data Preparation

To train the SGNS model, a natural language dataset is needed first. Generally, English text is used, but any desired language or data can also be utilized. The data is cleaned, and each word’s index is mapped for use in model training.

4.2 Model Structure Design

The structure of the SGNS model is as follows:

Input Layer: One-hot encoding vectors of words
Hidden Layer: Parameter matrix for word embedding
Output Layer: Softmax function for predicting surrounding words

4.3 Loss Function

The loss function of SGNS uses log loss to predict surrounding words from the given center word. This allows for finding optimal parameters.

4.4 Parameter Update

In the training process of SGNS, parameters are updated using a lightweight negative sampling method. This enhances both the training speed and performance of the model simultaneously.

4.5 Final Implementation

Below is a simple example of the SGNS implementation written in Python:


import numpy as np

class SGNS:
    def __init__(self, vocab_size, embedding_dim, negative_samples):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.negative_samples = negative_samples
        self.W1 = np.random.rand(vocab_size, embedding_dim)  # Input word embedding
        self.W2 = np.random.rand(embedding_dim, vocab_size)  # Output word embedding

    def train(self, center_word_idx, context_word_idx):
        positive = np.dot(self.W1[center_word_idx], self.W2[:, context_word_idx])
        negative_samples = np.random.choice(range(self.vocab_size), self.negative_samples, replace=False)

        # Positive and negative sampling updates
        # Apply gradient descent and update W1 and W2

# Use the SGNS model here, loading data and training it accordingly.

5. Results and Applications of SGNS

The word vectors generated by the SGNS model can be applied to various natural language processing tasks. For example, they show excellent performance in document classification, sentiment analysis, machine translation, and more.

By expressing the meanings of words well in a continuous vector space, machines can understand and process human language more easily.

6. Conclusion

This article has provided a detailed explanation of the Skip-Gram model of Word2Vec and negative sampling, which are techniques for natural language processing utilizing deep learning. It has offered insights into the implementation of SGNS and data processing methods. The field of natural language processing continues to evolve, and it is hoped that these technologies will be used to create better language models.

7. References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26.
Goldberg, Y., & Levy, O. (2014). word2vec Explained: Intuition and Methodology. arXiv preprint arXiv:1402.3722.