hugging face transformer practical lecture, Moderna vs Pfizer t-SNE visualization

In recent years, the bio field has seen many innovations due to advancements in deep learning technologies. In particular, the Hugging Face Transformer library has gained significant attention in natural language processing (NLP), providing various models and tools. This lecture will explain the t-SNE method for visualizing vaccine-related data from Moderna and Pfizer using Hugging Face Transformers.

1. Understanding Transformers

The transformer model was first introduced in the 2017 paper “Attention is All You Need.” Unlike traditional RNN and LSTM models, transformers use a self-attention mechanism that allows them to process all positions of input data simultaneously. Thanks to this feature, transformer models demonstrate outstanding performance on large datasets.

2. Installing and Setting Up the Hugging Face Library

To use the Hugging Face Transformer library, you must first install it. You can do this using the following command:

pip install transformers datasets

3. Data Preparation

This lecture uses data related to vaccines from Moderna and Pfizer to perform t-SNE visualization. The steps for collecting and preprocessing the data can be easily described as follows:

Collecting text data for each vaccine
Preprocessing the collected text data (converting to lowercase, removing punctuation, etc.)
Creating embeddings using the preprocessed text

3.1 Example of Data Collection

You can create a dataset by crawling articles about Moderna and Pfizer or by using pre-prepared CSV files. Below is an example of loading datasets for Moderna and Pfizer.

import pandas as pd

moderna_df = pd.read_csv('moderna.csv')
pfizer_df = pd.read_csv('pfizer.csv')

# Checking the data
print(moderna_df.head())
print(pfizer_df.head())

4. Creating Text Embeddings

To enable the model to understand text, you need to create embedding vectors. You can use Hugging Face’s ‘BERT’ or ‘DistilBERT’ models to embed the text. Refer to the code below to create embeddings.

from transformers import DistilBertTokenizer, DistilBertModel
import torch

# Combining text data from Moderna and Pfizer
texts = list(moderna_df['text']) + list(pfizer_df['text'])

# Initializing the model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Tokenizing the input data
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state.mean(dim=1)  # Creating embedding by averaging

5. t-SNE Visualization

t-SNE (technology that preserves the differences between two high-dimensional data points) is effective for visualizing high-dimensional data in two or three dimensions. The code below demonstrates how to visualize the data distribution for Moderna and Pfizer using t-SNE.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Reducing dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=0)
tsne_results = tsne.fit_transform(embeddings)

# Visualizing the results
plt.figure(figsize=(10, 7))
plt.scatter(tsne_results[:len(moderna_df), 0], tsne_results[:len(moderna_df), 1], label='Moderna', alpha=0.5)
plt.scatter(tsne_results[len(moderna_df):, 0], tsne_results[len(moderna_df):, 1], label='Pfizer', alpha=0.5)
plt.title('t-SNE Visualization of Moderna vs Pfizer')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()

6. Analyzing Results

By analyzing the results of t-SNE, you can understand the relationship between Moderna and Pfizer. By visually examining how the data points are distributed, you can also learn about the characteristics and differences of each vaccine. This analysis can contribute to scientific research and the formulation of marketing strategies.

Conclusion

Using Hugging Face’s transformer models makes it easy to create embeddings for complex text data, allowing for the analysis of data through various visualization techniques. The knowledge gained from this lecture will be greatly helpful in analyzing bio data, particularly on sensitive topics such as vaccine data. In the future, deeper analyses can be conducted using other models and techniques.