Deep Learning PyTorch Course, Transformer Attention

Deep learning has become a key technology that has brought innovations to the field of artificial intelligence (AI) in recent years. Among various deep learning models, the Transformer has shown outstanding performance in the field of Natural Language Processing (NLP) and has attracted the attention of many researchers. In this article, we will provide an in-depth explanation of the Transformer architecture and attention mechanism using the PyTorch framework, along with practical code examples.

1. What is a Transformer?

The Transformer is a model proposed by researchers including Vaswani from Google in 2017, designed to overcome the limitations of traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs). The Transformer can process the entire input sequence at once, making parallelization easier and allowing it to learn longer dependencies.

1.1 Structure of the Transformer

The Transformer consists of two main components: the encoder and the decoder. The encoder takes in the input sequence, and the decoder generates the output sequence based on the encoder’s output. The key part here is the attention mechanism.

2. Attention Mechanism

Attention is a mechanism that allows focusing on specific parts of the input sequence. In other words, each word (or input vector) computes weights based on its relationships with other words to extract information. Attention fundamentally consists of three elements: Query, Key, and Value.

2.1 Attention Score

The attention score is calculated as the dot product between the query and key. This score indicates how much each word in the input sequence influences the current word.

2.2 Softmax Function

To normalize the attention scores, the softmax function is used to compute the weights. This ensures that all weights fall between 0 and 1, and their sum equals 1.

2.3 Attention Operation

Once the weights are determined, they are multiplied with the Values to generate the final attention output. The final output is the sum of the weighted Values.

3. Implementing Transformer with PyTorch

Now, let’s implement the Transformer and attention mechanism using PyTorch. The code below is an example of a basic attention module.

3.1 Installing Required Libraries

!pip install torch torchvision

3.2 Implementing Attention Class


import torch
import torch.nn as nn
import torch.nn.functional as F

class ScaledDotProductAttention(nn.Module):
    def __init__(self):
        super(ScaledDotProductAttention, self).__init__()

    def forward(self, query, key, value, mask=None):
        # Calculate dot product between query and key
        scores = torch.matmul(query, key.transpose(-2, -1)) / (key.size(-1) ** 0.5)

        # Masking if a mask is provided
        if mask is not None:
            scores.masked_fill_(mask == 0, -1e9)

        # Normalize using softmax function
        attn_weights = F.softmax(scores, dim=-1)

        # Calculate attention output by multiplying weights with values
        output = torch.matmul(attn_weights, value)
        return output, attn_weights
    

3.3 Implementing Transformer Encoder


class TransformerEncoder(nn.Module):
    def __init__(self, embed_size, heads, num_layers, drop_out):
        super(TransformerEncoder, self).__init__()
        self.embed_size = embed_size
        self.heads = heads
        self.num_layers = num_layers
        self.drop_out = drop_out

        self.attention = ScaledDotProductAttention()
        self.linear = nn.Linear(embed_size, embed_size)
        self.dropout = nn.Dropout(drop_out)
        self.norm = nn.LayerNorm(embed_size)

    def forward(self, x, mask):
        for _ in range(self.num_layers):
            attention_output, _ = self.attention(x, x, x, mask)
            x = self.norm(x + self.dropout(attention_output))
            x = self.norm(x + self.dropout(self.linear(x)))
        return x
    

4. Model Training and Evaluation

After implementing the Transformer encoder, we will explain how to train and evaluate the model using real data.

4.1 Data Preparation

To train the model, we first need to prepare the training data. Typically, sequence data such as text data is used.

4.2 Model Initialization


embed_size = 256  # Embedding dimension
heads = 8  # Number of attention heads
num_layers = 6  # Number of encoder layers
drop_out = 0.1  # Dropout rate

model = TransformerEncoder(embed_size, heads, num_layers, drop_out)
    

4.3 Setting Loss Function and Optimizer


optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
loss_fn = nn.CrossEntropyLoss()
    

4.4 Training Loop


for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        output = model(batch['input'], batch['mask'])
        loss = loss_fn(output.view(-1, output.size(-1)), batch['target'])
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch: {epoch+1}, Loss: {total_loss/len(train_loader)}")
    

4.5 Evaluation and Testing

After training is completed, we evaluate the model to measure its performance. Generally, metrics such as accuracy, precision, and recall are used on test data.

5. Conclusion

In this article, we explained the Transformer architecture and attention mechanism, and demonstrated how to implement them using PyTorch. The Transformer model is useful for building high-performance natural language processing models and is applied in various fields. Since the performance can vary significantly depending on the training data and model hyperparameters, it is important to find the optimal combination through various experiments.

The Transformer is currently making innovative contributions to NLP modeling and is expected to continue to evolve through various research outcomes. In the next article, we will cover the use cases of Transformer models in natural language processing. We appreciate your interest.

© 2023 Deep Learning Research Institute. All Rights Reserved.