Using Hugging Face Transformers, Classification Report

In recent years, the field of Natural Language Processing (NLP) has made significant advancements. At the center of this is Deep Learning and Transformer models, particularly Hugging Face‘s Transformers library, which is widely used by many researchers and developers. In this article, we will explore how to train and evaluate a text classification model using Hugging Face’s Transformers library.

1. Introduction to Hugging Face Transformers Library

Hugging Face’s Transformers library is an open-source library that helps users easily utilize various pre-trained transformer models and fine-tune them according to their own data. It includes various models such as BERT, GPT-2, and RoBERTa, and its API is intuitive and easy to use.

2. Definition of Text Classification Problem

Text classification is the task of categorizing given text data into one or more class labels. For example, it involves determining whether an email is spam or not, or classifying a movie review as positive or negative. In this course, we will build a model to classify IMDB movie reviews as positive and negative using a simple example.

3. Data Loading and Basic Preprocessing

First, we will install the necessary libraries and load the IMDB dataset. The IMDB dataset includes movie reviews and their corresponding sentiment labels.

python
# Install necessary libraries
!pip install transformers torch datasets

# Import libraries
from datasets import load_dataset

# Load IMDB dataset
dataset = load_dataset('imdb')

print(dataset)

When the above code is executed, you will see that the IMDB dataset is loaded and divided into train and validation sets. Each dataset includes movie reviews and sentiment labels.

4. Data Preprocessing

To input data into the model, text tokenization and encoding are needed. We will process the text using the tokenizer provided by Hugging Face’s transformer models.

python
from transformers import AutoTokenizer

# Set model name
model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Check sample data
sample_text = dataset['train'][0]['text']
encoded_input = tokenizer(sample_text, padding='max_length', truncation=True, return_tensors='pt')

print(encoded_input)

The above code tokenizes the first movie review using the DistilBERT model’s tokenizer and outputs the encoded tensor after applying padding and truncation to fit the maximum length.

5. Model Definition and Training

Now we will define the model and proceed with training. The Hugging Face Trainer API allows us to conduct the training process conveniently.

python
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Create Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test']
)

# Train the model
trainer.train()

The above code defines a text classification model based on DistilBERT and trains it using the Trainer API with the IMDB dataset. Once the training is complete, the weights are saved in the ./results folder.

6. Model Evaluation

After training the model, we will evaluate its performance using the test dataset. We will use accuracy as the evaluation metric.

python
# Evaluate with test dataset
results = trainer.evaluate()

print(f"Accuracy: {results['eval_accuracy']:.2f}")

After model evaluation, the accuracy will be printed. This allows us to check the model’s performance.

7. Predictions and Classification Report

Now, we can use the trained model to make predictions on new data. We will check the prediction results and print the classification report with the following code.

python
from sklearn.metrics import classification_report
import numpy as np

# Prepare prediction data
predictions = trainer.predict(dataset['test'])
preds = np.argmax(predictions.predictions, axis=1)

# Print classification report
report = classification_report(dataset['test']['label'], preds)
print(report)

The above code performs predictions on the test dataset and uses sklearn’s classification_report to output metrics such as Precision, Recall, and F1-Score. This report provides detailed information about the model’s performance.

8. Conclusion and Next Steps

In this course, we explored how to build a simple text classification model using Hugging Face’s Transformers library and evaluate it. To continuously improve the model’s performance, more diverse techniques can be applied during the data preprocessing stage, or hyperparameter tuning can be considered.

In the future, I plan to cover various natural language processing problems and conduct advanced courses utilizing the Hugging Face Transformers library, so I appreciate your interest. Thank you!

Using Hugging Face Transformers, Moderna Pfizer Covid-19 Vaccine BERT [CLS] Vector Extraction

With the advancement of deep learning and natural language processing (NLP), many companies are exploring various methods to analyze text data. Among these, BERT (Bidirectional Encoder Representations from Transformers) has established itself as an innovative model for deeply understanding the meaning of text data. In this course, we will cover how to extract the [CLS] vector from texts related to Moderna and Pfizer Covid-19 vaccines using Hugging Face’s Transformers library.

1. Introduction to the BERT Model

BERT is a pre-trained language model developed by Google that understands the context of a given sentence and can be utilized for various natural language processing tasks. The structure of BERT is as follows:

  • Bidirectional: BERT processes sentences in both directions to understand the context. This allows it to grasp the meaning of words in relation to surrounding words.
  • Transformer: BERT is based on the Transformer architecture and learns the relationships between all words in a sentence through the self-attention mechanism.
  • [CLS] Token: A special token called [CLS] is always added to the beginning of the input sentences to the BERT model. The vector of this token represents the overall meaning of the sentence and plays an important role in classification tasks.

2. Installing the Hugging Face Transformers Library

The Hugging Face Transformers library provides various models and tokenizers for natural language processing tasks. The installation proceeds as follows:

pip install transformers torch

3. Data Preparation

Now, we will prepare the documents related to Moderna and Pfizer. Here, we will use simple sentences as examples. In actual use, more data should be collected.

texts = [
        "The Moderna Covid-19 vaccine showed an efficacy of 94.1%.",
        "The efficacy of the Pfizer vaccine was reported to be 95%.",
        "Both the Moderna and Pfizer vaccines use mRNA technology."
    ]

4. Loading the BERT Model and Extracting Vectors

After loading the BERT model and tokenizer, we will introduce how to extract the [CLS] vector for the input sentences.


from transformers import BertTokenizer, BertModel
import torch

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Input text
texts = [
    "The Moderna Covid-19 vaccine showed an efficacy of 94.1%.",
    "The efficacy of the Pfizer vaccine was reported to be 95%.",
    "Both the Moderna and Pfizer vaccines use mRNA technology."
]

# Extract [CLS] vectors
cls_vectors = []
for text in texts:
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    cls_vector = outputs.last_hidden_state[0][0]  # [CLS] vector
    cls_vectors.append(cls_vector.detach().numpy())

5. Result Analysis

By running the above code, the [CLS] vectors for each sentence will be extracted. These vectors represent the meaning of the sentences in a high-dimensional space and can be utilized in various subsequent NLP tasks.

5.1. Example of Vector Visualization

The extracted vectors can be visualized or clustered to analyze the similarity between sentences.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# Reduce vectors to 2 dimensions
pca = PCA(n_components=2)
reduced_vectors = pca.fit_transform(np.array(cls_vectors))

# Visualization
plt.figure(figsize=(10, 6))
for i, text in enumerate(texts):
    plt.scatter(reduced_vectors[i, 0], reduced_vectors[i, 1])
    plt.annotate(text, (reduced_vectors[i, 0], reduced_vectors[i, 1]))
plt.title('BERT [CLS] Vectors Visualization')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.grid()
plt.show()

6. Conclusion

In this course, we covered the process of extracting [CLS] vectors from texts related to Moderna and Pfizer vaccines using the Hugging Face Transformers library with the BERT model. Through this, we have laid the foundation for understanding the meaning of text data and its application in various NLP applications.

These technologies can be applied in many fields, such as research papers and social opinion analysis, and will continue to advance in the future. Later, we will address more diverse application examples, such as classification problems and sentiment analysis using these vectors.

7. References

hugging face transformer practical lecture, Moderna vs Pfizer t-SNE visualization

In recent years, the bio field has seen many innovations due to advancements in deep learning technologies. In particular, the Hugging Face Transformer library has gained significant attention in natural language processing (NLP), providing various models and tools. This lecture will explain the t-SNE method for visualizing vaccine-related data from Moderna and Pfizer using Hugging Face Transformers.

1. Understanding Transformers

The transformer model was first introduced in the 2017 paper “Attention is All You Need.” Unlike traditional RNN and LSTM models, transformers use a self-attention mechanism that allows them to process all positions of input data simultaneously. Thanks to this feature, transformer models demonstrate outstanding performance on large datasets.

2. Installing and Setting Up the Hugging Face Library

To use the Hugging Face Transformer library, you must first install it. You can do this using the following command:

pip install transformers datasets

3. Data Preparation

This lecture uses data related to vaccines from Moderna and Pfizer to perform t-SNE visualization. The steps for collecting and preprocessing the data can be easily described as follows:

  • Collecting text data for each vaccine
  • Preprocessing the collected text data (converting to lowercase, removing punctuation, etc.)
  • Creating embeddings using the preprocessed text

3.1 Example of Data Collection

You can create a dataset by crawling articles about Moderna and Pfizer or by using pre-prepared CSV files. Below is an example of loading datasets for Moderna and Pfizer.

import pandas as pd

moderna_df = pd.read_csv('moderna.csv')
pfizer_df = pd.read_csv('pfizer.csv')

# Checking the data
print(moderna_df.head())
print(pfizer_df.head())

4. Creating Text Embeddings

To enable the model to understand text, you need to create embedding vectors. You can use Hugging Face’s ‘BERT’ or ‘DistilBERT’ models to embed the text. Refer to the code below to create embeddings.

from transformers import DistilBertTokenizer, DistilBertModel
import torch

# Combining text data from Moderna and Pfizer
texts = list(moderna_df['text']) + list(pfizer_df['text'])

# Initializing the model and tokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained('distilbert-base-uncased')

# Tokenizing the input data
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
    outputs = model(**inputs)

embeddings = outputs.last_hidden_state.mean(dim=1)  # Creating embedding by averaging

5. t-SNE Visualization

t-SNE (technology that preserves the differences between two high-dimensional data points) is effective for visualizing high-dimensional data in two or three dimensions. The code below demonstrates how to visualize the data distribution for Moderna and Pfizer using t-SNE.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Reducing dimensions to 2D using t-SNE
tsne = TSNE(n_components=2, random_state=0)
tsne_results = tsne.fit_transform(embeddings)

# Visualizing the results
plt.figure(figsize=(10, 7))
plt.scatter(tsne_results[:len(moderna_df), 0], tsne_results[:len(moderna_df), 1], label='Moderna', alpha=0.5)
plt.scatter(tsne_results[len(moderna_df):, 0], tsne_results[len(moderna_df):, 1], label='Pfizer', alpha=0.5)
plt.title('t-SNE Visualization of Moderna vs Pfizer')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()

6. Analyzing Results

By analyzing the results of t-SNE, you can understand the relationship between Moderna and Pfizer. By visually examining how the data points are distributed, you can also learn about the characteristics and differences of each vaccine. This analysis can contribute to scientific research and the formulation of marketing strategies.

Conclusion

Using Hugging Face’s transformer models makes it easy to create embeddings for complex text data, allowing for the analysis of data through various visualization techniques. The knowledge gained from this lecture will be greatly helpful in analyzing bio data, particularly on sensitive topics such as vaccine data. In the future, deeper analyses can be conducted using other models and techniques.

References

“Hugging Face Transformer Utilization Course, Moderna COVID-19 Wikipedia Text Retrieval”

With the advancements in deep learning and Natural Language Processing (NLP), the methods for processing and analyzing text data have diversified. In this post, I will explain in detail how to retrieve COVID-19 information related to Moderna from Wikipedia using the Hugging Face library. Hugging Face Transformers provide many pretrained models widely used in NLP tasks, allowing users to easily analyze text data.

1. What is Hugging Face?

Hugging Face is a platform that provides various pretrained models and tools to facilitate easy use of NLP models. In particular, the Transformers library includes various state-of-the-art transformer models, such as BERT, GPT-2, and T5, enabling users to perform natural language processing tasks more easily.

1.1 Key Features of Hugging Face

  • Provision of pretrained models: Pretrained models for various NLP tasks are available.
  • Easy utilization of models: Models can be used straightforwardly without writing complex code.
  • Large community: User-created models and datasets are shared, providing various options to choose from.

2. Installation and Environment Setup

To use the Hugging Face library, you need to set up a Python environment. You can install the required libraries with the command below.

pip install transformers wikipedia-api

3. Retrieving Information from Wikipedia

To retrieve information related to Moderna and COVID-19 from Wikipedia, we will use wikipedia-api. This library provides the ability to easily search Wikipedia pages and fetch their content.

3.1 Example of Retrieving Wikipedia Data

The code below is a simple example that searches for information about Moderna and prints the content.

import wikipediaapi

    # Initialize Wikipedia API
    wiki_wiki = wikipediaapi.Wikipedia('en')

    # Retrieve "Moderna" page
    page = wiki_wiki.page("Moderna")

    # Print page content
    if page.exists():
        print("Title: ", page.title)
        print("Summary: ", page.summary[0:1000])  # Print first 1000 characters
    else:
        print("Page does not exist.")

By running the above code, you can retrieve content from the Wikipedia page of Moderna. Now, let’s check for additional information related to COVID-19.

3.2 Retrieving COVID-19 Related Information

Similarly, the code to retrieve information about COVID-19 from Wikipedia is as follows.

# Retrieve "COVID-19" page
    covid_page = wiki_wiki.page("COVID-19")

    # Print page content
    if covid_page.exists():
        print("Title: ", covid_page.title)
        print("Summary: ", covid_page.summary[0:1000])  # Print first 1000 characters
    else:
        print("Page does not exist.")

4. Text Preprocessing

The text retrieved from Wikipedia must go through a preprocessing step before being inputted into the model. This process involves removing unnecessary characters or symbols and organizing the necessary information.

4.1 Preprocessing Steps

The code below shows how to remove unnecessary characters from the retrieved text and organize it in a list format.

import re

    def preprocess_text(text):
        # Remove special characters
        text = re.sub(r'[^A-Za-z0-9\s]', '', text)
        # Replace multiple spaces with a single space
        text = re.sub(r'\s+', ' ', text)
        return text.strip()

    # Example of preprocessing
    processed_text_moderna = preprocess_text(page.summary)
    processed_text_covid = preprocess_text(covid_page.summary)

    print("Processed Moderna Text: ", processed_text_moderna)
    print("Processed COVID-19 Text: ", processed_text_covid)

5. Analyzing Information with Hugging Face Transformers

To analyze the retrieved data, we can use Hugging Face Transformers. Here, we will look at how to input the preprocessed text into the BERT model to extract features.

5.1 Using BERT Model

Let’s use the Hugging Face BERT model to extract features from the preprocessed text. Please refer to the code below.

from transformers import BertTokenizer, BertModel
    import torch

    # Load BERT model and tokenizer
    model_name = 'bert-base-multilingual-cased'
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)

    # Tokenize text and convert to tensor
    inputs = tokenizer(processed_text_moderna, return_tensors='pt', padding=True, truncation=True)

    # Feed into model and extract features
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Feature vector
    embeddings = outputs.last_hidden_state
    print("Embedding Size: ", embeddings.shape)

6. Practice Example: Summarizing COVID-19 Related Documents

Now we will create a summary based on COVID-19 information. We can generate a summary using the GPT-2 model from the Hugging Face library.

6.1 Summarizing with GPT-2 Model

from transformers import GPT2Tokenizer, GPT2LMHeadModel

    # Load GPT-2 model and tokenizer
    gpt2_model = GPT2LMHeadModel.from_pretrained('gpt2')
    gpt2_tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

    # Input text for summarization
    input_text = "COVID-19 is caused by SARS-CoV-2..."
    input_ids = gpt2_tokenizer.encode(input_text, return_tensors='pt')

    # Generate summary
    summary_ids = gpt2_model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)
    summary = gpt2_tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    print("Generated Summary: ", summary)

Conclusion

In this post, we explored how to retrieve information related to Moderna and COVID-19 from Wikipedia using Hugging Face Transformers, as well as the processes of preprocessing and analyzing the data. Hugging Face is a great tool for easily utilizing the latest natural language processing models, enabling more effective utilization of text data. In the future, we can further develop our data analysis skills through various NLP tasks.

Moreover, Hugging Face is continuously updating new models and datasets through collaboration with the community, so ongoing learning and application are encouraged. I hope you will challenge yourselves with diverse NLP tasks and achieve greater achievements.

References

Using Hugging Face Transformers, Label Encoding

In this course, we will explain in detail the important preprocessing process of label encoding in building deep learning models.
Label encoding is a technique mainly used in classification problems, which converts categorical data into numbers.
This process helps machine learning algorithms understand the input data.

The Necessity of Label Encoding

Most machine learning models accept numerical data as input. However, our data is often provided in the form of categorical text data. For instance, when there are two labels, cat and dog,
we cannot directly input these into the model. Therefore, through label encoding, cat should be converted to 0 and dog to 1.

Introduction to Hugging Face Transformers Library

Hugging Face is a library that allows easy utilization of natural language processing (NLP) models and datasets.
Among them, the Transformers library provides various pre-trained models, allowing developers to easily build and fine-tune NLP models.

Python Code Example for Label Encoding

In this example, we will perform label encoding using the sklearn library’s LabelEncoder class.

python
import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Example data creation
data = {'Animal': ['cat', 'dog', 'dog', 'cat', 'rabbit']}
df = pd.DataFrame(data)

print("Original data:")
print(df)

# Initialize label encoder
label_encoder = LabelEncoder()

# Perform label encoding
df['Animal_Encoding'] = label_encoder.fit_transform(df['Animal'])

print("\nData after label encoding:")
print(df)
    

Code Explanation

1. First, we create a simple DataFrame using the pandas library.
2. Then, we initialize the LabelEncoder class and use the fit_transform method to convert the categorical data in the Animal column to numbers.
3. Finally, we add the encoded data as a new column and display it.

Label Encoding in Training and Test Data

When building a machine learning model, label encoding must be performed on both the training and test data.
A crucial point to remember is that we should call the fit method on the training data, and then call the transform method on the test data,
ensuring the same encoding method is applied.

python
# Create training and test data
train_data = {'Animal': ['cat', 'dog', 'dog', 'cat']}
test_data = {'Animal': ['cat', 'rabbit']}

train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

# Fit on training data
label_encoder = LabelEncoder()
label_encoder.fit(train_df['Animal'])

# Encoding training data
train_df['Animal_Encoding'] = label_encoder.transform(train_df['Animal'])

# Encoding test data
test_df['Animal_Encoding'] = label_encoder.transform(test_df['Animal'])

print("Training data encoding result:")
print(train_df)

print("\nTest data encoding result:")
print(test_df)
    

Explanation for Understanding

The code above creates training and test dataframes separately and fits the LabelEncoder on the training data.
After that, consistent label encoding is performed on both the training and test data using the trained encoder.

Limitations and Cautions

While label encoding is simple and useful, in some cases, it can lose the inherent order of the data. For example,
if we have the expressions small, medium, large, converting them to 0, 1, 2 through label encoding
may not guarantee the relation of size. In such cases, One-Hot Encoding should be considered.

Conclusion

In this course, we learned about the importance of label encoding and how to implement it without using the Hugging Face Transformers library.
Such data preprocessing processes significantly affect the performance of deep learning and machine learning models, so it is essential to understand and apply them well.

Additional Resources

For more information, please refer to the official Hugging Face documentation: Hugging Face Documentation.