Deep Learning for Natural Language Processing: Bahdanau Attention

Natural language processing is a technology that enables computers to understand and generate human language, and it is one of the important fields of artificial intelligence. In recent years, deep learning technology has brought innovations to natural language processing (NLP), among which the attention mechanism stands out as a particularly remarkable technology. In this article, we will explain the Bahdanau Attention mechanism in depth and explore its principles and use cases.

1. Deep Learning in Natural Language Processing

Deep learning is a field of machine learning that utilizes artificial neural networks, allowing for the learning of complex patterns through a multilayered structure. In the field of natural language processing, deep learning is being used for various purposes such as:

  • Machine translation
  • Sentiment analysis
  • Text summarization
  • Question answering systems

1.1 Recurrent Neural Networks (RNN)

One of the models commonly used in natural language processing is the Recurrent Neural Network (RNN). RNNs have a structure that is suitable for processing sequential data (e.g., sentences), allowing them to remember previous information and reflect it in the current input. However, basic RNNs face the issue of vanishing gradients when dealing with long sequences, leading to a decline in performance.

1.2 Long Short-Term Memory Networks (LSTM)

To address this problem, Long Short-Term Memory (LSTM) networks were developed. LSTM uses cell states and gates to effectively remember information and forget it when necessary. However, LSTM still treats all information in the sequence equally, necessitating a greater focus on specific parts of the input sequence.

2. Introduction of the Attention Mechanism

The attention mechanism is a method that complements the general structure of RNNs and LSTMs, allowing for the processing of information by placing more weight on specific parts of the input data. Through this mechanism, the model can selectively emphasize important information, providing better performance and interpretability.

2.1 Basic Principle of the Attention Mechanism

The attention mechanism works by calculating weights for each element of the input sequence and impacting the final output through these weights. The weights are determined based on the relationships between all elements of the input and learn which information is more important within a given input sequence.

2.2 Bahdanau Attention

Bahdanau Attention is an attention mechanism proposed in 2014 by D. Bahdanau and his research team. This method is primarily used in sequence-to-sequence models, such as machine translation. Bahdanau Attention operates in an encoder-decoder structure and calculates weights through the following process.

3. Structure of Bahdanau Attention

Bahdanau Attention is divided into two parts: the encoder and the decoder. The encoder processes the input sequence, and the decoder generates the output sequence. The essence of the attention mechanism is to combine each output of the encoder with the current state of the decoder to produce the desired output.

3.1 Encoder

The encoder accepts the input sequence and converts it into high-dimensional vectors. It processes the input word sequence using either RNN or LSTM and outputs the hidden state at each time step. This hidden state encapsulates the meaning of the sequence and serves as the basic information for the attention mechanism.

3.2 Calculation of Attention Weights

When generating outputs in the decoder, weights are calculated based on the similarity between the current state and all hidden states of the encoder. This process involves the following steps:

  1. Calculate the similarity between the current hidden state of the decoder ht and all hidden states of the encoder hi. This is typically done in a weighted sum manner.
  2. Convert the weight for each hidden state αti into a probability distribution using the softmax function.

Here, similarity is usually calculated using a dot product or through a standard neural network.

3.3 Generation of Context Vectors

After the weights are calculated, a weighted sum is performed by multiplying each hidden state of the encoder by its corresponding weight. As a result, a context vector ct for each time step is generated. This vector is used in combination with the current state of the decoder to generate the final output:

ct = Σi αti hi

3.4 Decoder

The context vector is input to the decoder, which uses the previous output and the current context vector to generate the next output. This process often involves the use of a softmax function, which is typically used to predict the next word:

yt = softmax(W * [ht, ct])

4. Advantages and Disadvantages of Bahdanau Attention

Bahdanau Attention has several advantages compared to traditional RNN or LSTM models:

  • Emphasis on Important Information: Bahdanau Attention can concentrate weights on important parts of the input sequence, making meaning transfer more effective.
  • Parallel Processing Capability: The attention mechanism can independently compute the results for each input element, making it suitable for parallel processing.
  • Interpretability: Visualizing attention weights makes it easier to explain how the model operates.

However, Bahdanau Attention also has some disadvantages:

  • Resource Consumption: Since weights must be calculated for all elements of the input sequence, performance degradation may occur with large datasets.
  • Limitations in Modeling Long-Term Dependencies: There may still be limitations in modeling comprehensive information in long sequences.

5. Use Cases of Bahdanau Attention

Bahdanau Attention is used in various natural language processing tasks. Let’s take a look at a few of them:

5.1 Machine Translation

In machine translation, Bahdanau Attention plays an essential role in accurately translating sentences from one language to another based on the context of the input sentence. For example, when translating an English sentence into French, it focuses more on specific words to create a natural sentence.

5.2 Sentiment Analysis

In sentiment analysis, it is possible to evaluate the overall sentiment based on the importance of specific words in a sentence. Bahdanau Attention can help capture the nuances of sentiment.

5.3 Text Summarization

In text summarization, the attention mechanism is utilized to select important sentences or words, allowing for information compression. This enables the transformation of lengthy documents into shorter, more concise forms.

6. Conclusion

Bahdanau Attention makes significant contributions to deep learning-based natural language processing. This mechanism helps models selectively emphasize information to produce more accurate and meaningful outputs, leading to improved performance in many natural language processing tasks. We anticipate further advancements in attention techniques and models through future research and development.

We hope this article has enhanced your understanding of Bahdanau Attention. A deep understanding of this technique is vital in leveraging modern natural language processing technologies.

Deep Learning Based Natural Language Processing, Attention Mechanism

Author: [Your Name]

Date: [Date]

1. Introduction

Natural language processing is a technology that allows computers to understand and process human language, and it has rapidly advanced in recent years with the development of deep learning. As the amount of text data has increased exponentially, various models have emerged to effectively process this data, among which the attention mechanism is particularly noteworthy.

This article explores the importance of deep learning and the attention mechanism in the field of natural language processing and introduces various application cases.

2. Basics of Deep Learning and Natural Language Processing

2.1 Overview of Deep Learning

Deep learning is a field of machine learning based on artificial neural networks, which has the ability to automatically learn features from data. It transforms input data complexly through multi-layer neural networks, thereby achieving a high level of abstraction.

2.2 Reasons for the Need for Natural Language Processing

Human language possesses characteristics that make it difficult for computers to understand due to its complexity and diversity. As the need for machines to understand and generate human language from large amounts of text data has grown, the field of natural language processing is actively being researched.

3. The Necessity of Attention Mechanism

3.1 Limitations of Traditional Sequence Models

Existing models like RNN (Recurrent Neural Network) or LSTM (Long Short-Term Memory) are effective in processing sequential data but have issues of information loss due to limitations of ‘memory’ when dealing with long sequences. This has led to a decline in performance in tasks like machine translation and summarization.

3.2 Emergence of Attention Mechanism

The attention mechanism was introduced to overcome these limitations, providing the ability to assign weights to each word in the input sequence. This allows the model to focus more on important information.

4. Working Principle of Attention Mechanism

4.1 Basic Concept

The attention mechanism involves a process of ‘paying attention’ to each element of a given input sequence. This allows the model to determine the importance of each word in the context and assign weights accordingly. These weights play an increasingly important role when extracting information from the given input.

4.2 Scoring Mechanism

The attention mechanism begins by scoring each element of the input sequence by comparing them with each other. This assesses which input element has higher importance relative to others. One of the most common scoring methods is the dot product.

5. Various Attention Techniques

5.1 Scoring-Based Attention

The scoring-based attention method assigns a score to each word and focuses attention based on the highest score. This method is simple and effective, making it widely used in many models. This technique is also used in the representative model, Transformer.

5.2 Self-Attention

The self-attention technique involves each word paying attention to itself within the given input data. This enables a better understanding of the relationships within the context. It has become a core element of the Transformer architecture.

6. Transformer and Attention Mechanism

6.1 Overview of Transformer Model

Transformer is an innovative model that uses the attention mechanism to process sequential data. Unlike the structure of traditional RNNs or LSTMs, it processes sequences solely with the attention mechanism, gaining the advantages of parallel processing and significantly improving training speed.

6.2 Encoder-Decoder Structure

The Transformer consists of an encoder and decoder, with each being stacked in multiple layers. The encoder encodes the input sequence into a high-dimensional representation, and the decoder generates the final output based on this representation. The attention mechanism plays a crucial role in this process.

7. Application Cases of Attention Mechanism

7.1 Machine Translation

The attention mechanism shows excellent performance, particularly in machine translation. By paying attention to each word in the input language, it generates more natural and accurate translation results.

7.2 Natural Language Generation

The attention mechanism is also greatly utilized in text generation, summarization, and Q&A systems. It emphasizes relevant information based on user input to generate more meaningful results.

8. Conclusion

Deep learning and the attention mechanism have led to revolutionary changes in the field of natural language processing. Their combination has allowed machines to understand human language more deeply and broadened the possibilities in various application fields. It is expected that natural language processing technology will continue to evolve and be utilized in more areas in the future.

I hope this article has helped enhance your understanding of natural language processing and the attention mechanism. I encourage you to explore more information and cases to contribute to future research and development.

Deep Learning for Natural Language Processing: Encoder-Decoder using RNN

Natural Language Processing (NLP) is a field of artificial intelligence (AI) that enables computers to understand and process human language. Recent advancements in deep learning have significantly expanded the possibilities of NLP, especially the Recurrent Neural Networks (RNN), which have become an architecture well suited to the characteristics of language data that require consideration of temporal order. In this article, we will delve deeply into the basic concepts of natural language processing using deep learning and the encoder-decoder structure utilizing RNN.

The Basics of Natural Language Processing

The goal of natural language processing is to transform human language into a form that computers can understand. This requires various techniques and algorithms. Representative NLP tasks include document classification, sentiment analysis, machine translation, and summarization. To perform these tasks, it is necessary to first process the data, extract the needed information, and then convert the results back into a form understandable by humans.

Deep Learning and Natural Language Processing

Although traditional NLP techniques were widely used in the past, the introduction of deep learning has brought rapid changes to this field. Deep learning possesses the ability to learn on its own using vast amounts of data, effectively handling the complex structures of human language. In particular, neural network-based models have the advantage of processing large amounts of information through interconnected nodes and recognizing various patterns.

RNN: Recurrent Neural Network

Recurrent Neural Networks (RNN) are a type of neural network designed to process sequence data. Language is inherently sequential, and the nature of previous words affecting the next word exists. RNNs use memory cells to remember previous information and combine it with current input to generate the next output.

Structure of RNN

A basic RNN has the following structure:

  • Input Layer: Receives input data at the current time step.
  • Hidden Layer: Utilizes hidden state information from the previous time step to compute the new hidden state.
  • Output Layer: Ultimately generates the output for the next time step.

Encoder-Decoder Structure

The encoder-decoder structure was primarily developed to solve sequence-to-sequence tasks, such as machine translation. This is useful in cases where the input and output sequences may have different lengths. The model is broadly divided into an encoder and a decoder.

Encoder

The encoder accepts the input sequence and compresses this information into a fixed-size vector (context vector). The hidden state output at the final stage of the encoder is used as the initial state of the decoder. In this process, RNN is used to process each word in the sequence.

Decoder

The decoder receives the context vector generated by the encoder and produces output at each time step. At this point, the decoder predicts the next output by taking the previous output as input.

Training the Encoder-Decoder

The encoder-decoder model is usually trained using a technique called teacher forcing. Teacher forcing refers to the method of using the original target output instead of the output predicted by the decoder in the previous step as the next input. This helps the model make accurate predictions quickly.

Attention Mechanism

An important aspect of the encoder-decoder structure is the Attention mechanism. The Attention mechanism allows the decoder to reference all hidden states generated by the encoder, assigning weights to each input word during output generation. This enables the model to better reflect important information, thereby improving performance.

Limitations of RNN and Their Solutions

While RNNs are powerful tools for processing sequence data, they also have some limitations. For instance, due to the gradient vanishing problem, it is often difficult to learn long sequences. To address this, variations such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have been developed.

LSTM and GRU

LSTM is a variant of RNN that uses memory cells to solve the long-term dependency problem. This structure manages information through input gates, deletion gates, and output gates to remember and discard more appropriate information. GRU is a simplified model compared to LSTM, offering similar performance while requiring less computation.

Practice: Implementing the Encoder-Decoder Model

Now it’s time to implement the RNN-based encoder-decoder model ourselves. We will use Python’s TensorFlow and Keras libraries for this purpose.

Prepare the Data

To train the model, an appropriate dataset must be prepared. For example, a simple English-French translation dataset can be used.

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Load Dataset
data_file = 'path/to/dataset.txt'
input_texts, target_texts = [], []

with open(data_file, 'r') as file:
    for line in file:
        input_text, target_text = line.strip().split('\t')
        input_texts.append(input_text)
        target_texts.append(target_text)

# Create Word Index
tokenizer = Tokenizer()
tokenizer.fit_on_texts(input_texts + target_texts)

input_sequences = tokenizer.texts_to_sequences(input_texts)
target_sequences = tokenizer.texts_to_sequences(target_texts)

max_input_length = max(len(seq) for seq in input_sequences)
max_target_length = max(len(seq) for seq in target_sequences)

input_sequences = pad_sequences(input_sequences, maxlen=max_input_length, padding='post')
target_sequences = pad_sequences(target_sequences, maxlen=max_target_length, padding='post')

# Split Dataset
X_train, X_test, y_train, y_test = train_test_split(input_sequences, target_sequences, test_size=0.2, random_state=42)

Build the Model

Now we need to define the encoder and decoder. We will build the encoder and decoder using Keras’s LSTM layer.

latent_dim = 256  # Latent Space Dimension

# Define Encoder
encoder_inputs = tf.keras.Input(shape=(None,))
encoder_embedding = tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=latent_dim)(encoder_inputs)
encoder_lstm = tf.keras.layers.LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embedding)
encoder_states = [state_h, state_c]

# Define Decoder
decoder_inputs = tf.keras.Input(shape=(None,))
decoder_embedding = tf.keras.layers.Embedding(input_dim=len(tokenizer.word_index)+1, output_dim=latent_dim)(decoder_inputs)
decoder_lstm = tf.keras.layers.LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_embedding, initial_state=encoder_states)
decoder_dense = tf.keras.layers.Dense(len(tokenizer.word_index)+1, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Create Model
model = tf.keras.Model([encoder_inputs, decoder_inputs], decoder_outputs)

Compile and Train the Model

After compiling the model, we can start training. The loss function can be categorical crossentropy, and the optimizer can be Adam.

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Target sequences need to be reshaped to (num_samples, max_target_length, num_classes)
y_train_reshaped = y_train.reshape(y_train.shape[0], y_train.shape[1], 1)

# Train the Model
model.fit([X_train, y_train], y_train_reshaped, batch_size=64, epochs=50, validation_data=([X_test, y_test], y_test_reshaped))

Making Predictions

Once the model is trained, we can make predictions for new input sequences.

# Define Prediction Function
def decode_sequence(input_seq):
    # Encode the input sequence using the encoder.
    states_value = encoder_model.predict(input_seq)
    
    # Define the starting input for the decoder.
    target_seq = np.zeros((1, 1))
    target_seq[0, 0] = tokenizer.word_index['starttoken']  # Start token
    
    stop_condition = False
    decoded_sentence = ''
    
    while not stop_condition:
        output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
        
        # Select the most probable word.
        sampled_token_index = np.argmax(output_tokens[0, -1, :])
        sampled_char = tokenizer.index_word[sampled_token_index]
        decoded_sentence += ' ' + sampled_char
        
        # Check for stopping condition
        if sampled_char == 'endtoken' or len(decoded_sentence) > max_target_length:
            stop_condition = True
            
        # Define the next input sequence.
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index
        
        states_value = [h, c]
    
    return decoded_sentence

Conclusion

In this article, we have taken a detailed look at the basics and applications of the encoder-decoder structure using RNN. We hope you realize the possibilities of deep learning in natural language processing and encourage you to utilize this technology in various applications. In the future, we expect to see a variety of innovative NLP solutions emerge through these technologies.

Deep Learning for Natural Language Processing and BLEU Score (Bilingual Evaluation Understudy Score)

Natural Language Processing (NLP) is a field of computer science that deals with understanding and processing human language, and has achieved significant results in recent years thanks to advances in deep learning. In this article, we will cover the basic concepts of natural language processing using deep learning, as well as the performance evaluation metric in the field of machine translation known as BLEU Score.

1. Basics of Deep Learning

Deep learning is a method of analyzing data using artificial neural networks, extracting features through multiple layers of neurons, and using them to make predictions. Deep learning has the following key characteristics:

  • Non-linearity: Deep learning introduces non-linearity through activation functions, allowing it to learn complex patterns.
  • Automatic feature extraction: Unlike traditional machine learning models, deep learning automatically extracts features from data.
  • Scalability: It tends to demonstrate continuous performance improvement with large volumes of data.

1.1 Structure of Neural Networks

Neural networks are fundamentally composed of an input layer, hidden layers, and an output layer. Each layer consists of neurons called nodes, which are interconnected to transmit information. Each connection has a weight, which regulates the flow of data.

1.2 Types of Deep Learning Models

The most common models in deep learning include:

  • Convolutional Neural Networks (CNN): Primarily used for processing image data.
  • Recurrent Neural Networks (RNN): A model that is useful for processing temporal information and is suitable for natural language processing.
  • Transformer: A model widely used in the latest natural language processing, utilizing parallel processing and the attention mechanism.

2. Natural Language Processing (NLP)

Natural language processing is a technology that enables computers to understand and process the languages used by humans. This field is used in various applications, including text analysis, machine translation, sentiment analysis, and data mining. Key tasks in natural language processing include:

  • Tokenization: The process of splitting a sentence into words.
  • Part-of-Speech Tagging: The task of assigning parts of speech to each word.
  • Named Entity Recognition: A technique for identifying people, places, organizations, etc.
  • Sentiment Analysis: The process of analyzing the sentiment of text to classify it as positive or negative.
  • Machine Translation: The task of translating text from one language to another.

2.1 Trends in Machine Translation

Machine translation is one of the core application areas of natural language processing, achieving remarkable progress over the last few years. It has evolved from previous rule-based translation systems to statistical models and currently to deep learning-based models. In particular, the seq2seq (Sequence-to-Sequence) model and the Transformer model have brought significant innovations to machine translation.

3. BLEU Score

BLEU (Bilingual Evaluation Understudy) is a metric designed to evaluate the quality of machine translation, calculating scores by measuring the n-gram overlap between the translation results and the reference translation.

3.1 Definition of BLEU Score

BLEU Score is calculated as follows:

  • n-gram overlap: Calculates the n-gram overlap rate between the machine translation results and the reference translation.
  • Precision: Evaluates the quality of the results generated by calculating the precision of n-grams.
  • Brevity Penalty: A penalty is imposed if the length of the generated translation is too short compared to the length of the reference translation.

3.2 BLEU Score Calculation Formula

The BLEU score is calculated as follows:

BLEU = BP * exp(∑(p_n)/N)

Where:

  • BP: Brevity Penalty
  • p_n: Precision of n-grams
  • N: The number of n-grams considered (e.g., from 1 to 4)

3.3 Advantages and Disadvantages of BLEU Score

Advantages of BLEU score:

  • Automation: It can be evaluated mechanically without human intervention.
  • Consistency: Provides consistent evaluation across multiple evaluators.
  • Fast calculation: Quickly generates scores through relatively simple calculations.

Disadvantages of BLEU score:

  • Local matching: It does not reflect context well, as it only looks at n-gram components.
  • Discrepancy with human evaluation: A high BLEU score does not necessarily mean that human evaluation is positive.

4. Conclusion

Natural language processing using deep learning has become a core element of information technology today, and the BLEU Score is an important tool for quantitatively assessing the performance of this technology. Future research needs to further enhance the quality of natural language processing and move toward a better understanding and use of human language.

As machine translation technology related to natural language processing continues to evolve, continuous improvement of evaluation metrics like BLEU Score is also important, which will further widen the scope of natural language processing applications along with technological advancements. We are now at a point where we need to consider the impact of advancements in deep learning and natural language processing on our lives.

Deep Learning for Natural Language Processing: Sequence-to-Sequence (Seq2Seq)

Natural Language Processing (NLP) is a field that enables machines to understand and generate human language. In recent years, significant innovations have been made due to advancements in deep learning technologies. Among these, the Sequence-to-Sequence (Seq2Seq) model plays a crucial role in various NLP tasks such as translation, summarization, and dialogue generation.

Introduction

The Sequence-to-Sequence (Seq2Seq) model is an artificial neural network structured to transform a given input sequence (e.g., a sentence) into an output sequence (e.g., a translated text). This model can be divided into two main components: the Encoder and the Decoder. The Encoder processes the input sequence and encodes it into a high-dimensional vector, while the Decoder generates the output sequence based on this vector. This structure is suitable for problems where the lengths of the input and output can differ, such as machine translation.

1. Deep Learning and Natural Language Processing

Deep learning models possess the ability to automatically learn features from input data, making them powerful tools for understanding the complex patterns of natural language. Early NLP systems relied on rule-based approaches or statistical models; however, since the introduction of deep learning technologies, they have demonstrated more sophisticated and superior performance.

2. Structure of the Seq2Seq Model

2.1 Encoder

The Encoder processes the input text to generate a fixed-length vector representation. Typically, recurrent neural networks (RNN) or Long Short-Term Memory (LSTM) networks are used to handle sequence data. At each time step of the Encoder, the previous state and the current input are combined to update to a new state, and the final state from the last time step is passed to the Decoder.

2.2 Decoder

The Decoder generates the output sequence based on the vector received from the Encoder. This can also use RNN or LSTM, taking the previous output as input to produce the next output at each time step. The Decoder often employs a start token and an end token to indicate the beginning and end of the output sequence.

3. Training of the Seq2Seq Model

Training a Seq2Seq model generally uses a supervised learning approach. The model is trained through a process that minimizes a loss function based on prepared input sequences and target output sequences. The cross-entropy loss function is commonly used, which measures the difference between the output distribution generated by the model and the actual distribution of correct answers.

3.1 Teacher Forcing

During the training process, most Seq2Seq models utilize the “Teacher Forcing” technique. In this method, the actual correct token is used as input at each time step of the Decoder, allowing the model to predict the next output. This helps the model to converge more quickly.

4. Variants of the Seq2Seq Model

4.1 Attention Mechanism

The basic structure of the Seq2Seq model has the drawback that it cannot prevent the loss of information. To address this, the Attention Mechanism was introduced. The Attention Mechanism allows the Decoder to assign weights not only to the previous outputs but also to all hidden states of the Encoder, enabling information retrieval based on relevance. This allows the model to sense the importance of meaning and generate more natural outputs.

4.2 Transformer Model

The Transformer model is structured based on the Attention Mechanism and plays a leading role in Seq2Seq learning. It is composed of Multi-Head Attention and Feed Forward Networks for both the Encoder and the Decoder, providing a significant advantage of enabling parallel processing, moving away from the sequential processing architecture of RNNs. This leads to a dramatic increase in training speed.

5. Application Fields

5.1 Machine Translation

The area where the Sequence-to-Sequence model was first fully utilized is machine translation. Modern translation systems like Google Translate are based on Seq2Seq and Transformer models, providing high translation quality.

5.2 Dialogue Generation

Seq2Seq models are also used in conversational AI, such as chatbot systems. Generating appropriate responses to user inputs is an important challenge in NLP, and Seq2Seq models are highly effective in this process.

5.3 Document Summarization

Another significant application in natural language processing is document summarization. Extracting key information and generating summaries from long documents facilitates suggestions and information dissemination. A Seq2Seq model can take long documents as input and produce summarized sentences as output.

6. Conclusion

Deep learning-based Sequence-to-Sequence models have brought significant innovations to the field of natural language processing. Through the development of the Encoder-Decoder structure and Attention Mechanism, we have achieved high performance in various tasks such as machine translation, dialogue generation, and document summarization. In the future, it is expected that Seq2Seq and its variants will continue to play an important role in increasingly advanced NLP systems.

References

  • Vaswani, A., et al. (2017). Attention Is All You Need. In Advances in Neural Information Processing Systems.
  • Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learning to Align and Translate. In International Conference on Learning Representations.
  • Luong, M. T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.