Using Hugging Face Transformers, Moderna Pfizer Covid-19 Vaccine BERT [CLS] Vector Extraction

With the advancement of deep learning and natural language processing (NLP), many companies are exploring various methods to analyze text data. Among these, BERT (Bidirectional Encoder Representations from Transformers) has established itself as an innovative model for deeply understanding the meaning of text data. In this course, we will cover how to extract the [CLS] vector from texts related to Moderna and Pfizer Covid-19 vaccines using Hugging Face’s Transformers library.

1. Introduction to the BERT Model

BERT is a pre-trained language model developed by Google that understands the context of a given sentence and can be utilized for various natural language processing tasks. The structure of BERT is as follows:

  • Bidirectional: BERT processes sentences in both directions to understand the context. This allows it to grasp the meaning of words in relation to surrounding words.
  • Transformer: BERT is based on the Transformer architecture and learns the relationships between all words in a sentence through the self-attention mechanism.
  • [CLS] Token: A special token called [CLS] is always added to the beginning of the input sentences to the BERT model. The vector of this token represents the overall meaning of the sentence and plays an important role in classification tasks.

2. Installing the Hugging Face Transformers Library

The Hugging Face Transformers library provides various models and tokenizers for natural language processing tasks. The installation proceeds as follows:

pip install transformers torch

3. Data Preparation

Now, we will prepare the documents related to Moderna and Pfizer. Here, we will use simple sentences as examples. In actual use, more data should be collected.

texts = [
        "The Moderna Covid-19 vaccine showed an efficacy of 94.1%.",
        "The efficacy of the Pfizer vaccine was reported to be 95%.",
        "Both the Moderna and Pfizer vaccines use mRNA technology."
    ]

4. Loading the BERT Model and Extracting Vectors

After loading the BERT model and tokenizer, we will introduce how to extract the [CLS] vector for the input sentences.


from transformers import BertTokenizer, BertModel
import torch

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Input text
texts = [
    "The Moderna Covid-19 vaccine showed an efficacy of 94.1%.",
    "The efficacy of the Pfizer vaccine was reported to be 95%.",
    "Both the Moderna and Pfizer vaccines use mRNA technology."
]

# Extract [CLS] vectors
cls_vectors = []
for text in texts:
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    cls_vector = outputs.last_hidden_state[0][0]  # [CLS] vector
    cls_vectors.append(cls_vector.detach().numpy())

5. Result Analysis

By running the above code, the [CLS] vectors for each sentence will be extracted. These vectors represent the meaning of the sentences in a high-dimensional space and can be utilized in various subsequent NLP tasks.

5.1. Example of Vector Visualization

The extracted vectors can be visualized or clustered to analyze the similarity between sentences.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Reduce vectors to 2 dimensions
pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(np.array(cls_vectors))

# Visualization
plt.figure(figsize=(10, 6))
for i, text in enumerate(texts):
    plt.scatter(reduced_vectors[i, 0], reduced_vectors[i, 1])
    plt.annotate(text, (reduced_vectors[i, 0], reduced_vectors[i, 1]))
plt.title('BERT [CLS] Vectors Visualization')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid()
plt.show()

6. Conclusion

In this course, we covered the process of extracting [CLS] vectors from texts related to Moderna and Pfizer vaccines using the Hugging Face Transformers library with the BERT model. Through this, we have laid the foundation for understanding the meaning of text data and its application in various NLP applications.

These technologies can be applied in many fields, such as research papers and social opinion analysis, and will continue to advance in the future. Later, we will address more diverse application examples, such as classification problems and sentiment analysis using these vectors.

7. References