huggingface transformers tutorial, Pfizer COVID-19 Wikipedia text retrieval

Fetching Pfizer COVID-19 Wikipedia Text

In this course, we will learn how to fetch COVID-19 related information about Pfizer from Wikipedia using the Hugging Face Transformers library. This course is aimed at those who have a basic knowledge of natural language processing (NLP) and will guide you on how to comfortably use Hugging Face’s library with Python as a friend.

1. Environment Setup

First, we need to install the necessary libraries. Enter the code below to install transformers and wikipedia-api.

!pip install transformers wikipedia-api

2. Importing Libraries

Let’s import the necessary libraries. transformers helps in easily using natural language processing models. wikipedia-api allows easy access to the Wikipedia API.

import wikipediaapi
from transformers import pipeline

3. Fetching Information from Wikipedia

Now, let’s fetch COVID-19 and Pfizer-related information from Wikipedia. We will use wikipediaapi to get the information.

wiki_wiki = wikipediaapi.Wikipedia('en')
page = wiki_wiki.page("COVID-19_vaccine_Pfizer") 

if page.exists():
    print(page.text[0:1000])  # Print the first 1000 characters
else:
    print("The page does not exist.") 

Code Explanation

The above code retrieves the “COVID-19 Vaccine Pfizer” page from Wikipedia. If the page exists, it prints the first 1000 characters. This helps us verify the content of the information we want to fetch.

4. Summarizing the Text

Since the fetched data contains many long sentences, let’s summarize it using a natural language processing model. We will use the summarization model provided by the Hugging Face transformers library.

summarizer = pipeline("summarization")

summary = summarizer(page.text, max_length=130, min_length=30, do_sample=False)

print("Summary:")
for s in summary:
    print(s['summary_text'])

Code Explanation

This code performs text summarization through the Hugging Face “summarization” pipeline. You can adjust the length of the summary by setting max_length and min_length.

5. Conclusion

In this course, we learned how to fetch and summarize Pfizer’s COVID-19 information using Hugging Face Transformers and the Wikipedia API. We hope you have glimpsed the possibilities of natural language processing. These techniques can be applied in various fields and are useful tools for your projects.

6. Next Steps

Furthermore, try various natural language processing tasks such as sentiment analysis, question-answering systems, and document classification! We recommend exploring Hugging Face’s model hub to find and utilize models that suit you.

Thank you!

Hugging Face Transformers Practical Course, Learning and Validation Dataset Split

The importance of Natural Language Processing (NLP) in the fields of Artificial Intelligence (AI) and Machine Learning is increasing day by day. At the center of this is the Hugging Face Transformer library. This library makes it easy to use various NLP models, especially with the advantage of being able to easily apply pre-trained models. In this course, we will teach you how to split training and validation datasets using the Hugging Face Transformer library.

1. Preparing the Dataset

The first step is to prepare the dataset to be used. Generally, a labeled dataset is required to solve NLP problems. In this example, we will use the IMDb Movie Reviews Dataset to train a model that classifies positive and negative reviews. This dataset is widely used and consists of the text of movie reviews and their corresponding labels (positive/negative).

1.1 Downloading the Dataset

python
from datasets import load_dataset

dataset = load_dataset("imdb")

You can download the IMDb dataset using the above code. The load_dataset function is one available in the Hugging Face datasets library, which allows you to easily download various public datasets.

1.2 Checking the Dataset Structure

python
print(dataset)

You can check the structure of the downloaded dataset. The dataset is divided into training (train), testing (test), and validation (validation) sets.

2. Splitting the Dataset

In general, it is important to split the data into several parts to train a model in machine learning. Typically, the training data and validation data are split, where the training data is used to train the model, and the validation data is used to evaluate its performance. In this case, we will extract a portion of the training data to use as validation data.

2.1 Splitting Training and Validation Data

python
from sklearn.model_selection import train_test_split

train_data = dataset['train']
train_texts = train_data['text']
train_labels = train_data['label']

train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_texts,
    train_labels,
    test_size=0.1,  # Using 10% as validation set
    random_state=42
)

The above code uses the train_test_split function to split the training data into 90% and 10%. Since test_size=0.1 is set, 10% of the original training data is chosen as validation data. The random_state parameter ensures the consistency of the split.

2.2 Checking the Split Data

python
print("Number of training samples:", len(train_texts))
print("Number of validation samples:", len(val_texts))

You can now check the number of training and validation samples. This information helps to determine whether our data has been properly split.

3. Preparing the Hugging Face Transformer Model

After splitting the dataset, we need to prepare the model. Hugging Face’s Transformer library provides a variety of pre-trained models, allowing us to choose a model suitable for our needs.

3.1 Selecting a Pre-trained Model

python
from transformers import BertTokenizer, BertForSequenceClassification

model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

We prepare the BERT model using BertTokenizer and BertForSequenceClassification. This model is suitable for text classification tasks and uses the pre-trained version called “bert-base-uncased.”

3.2 Tokenizing the Data

python
train_encodings = tokenizer(train_texts, truncation=True, padding=True, return_tensors='pt')
val_encodings = tokenizer(val_texts, truncation=True, padding=True, return_tensors='pt')

We tokenize the training and validation data using the tokenizer. truncation=True handles inputs that exceed length limits, and padding=True ensures all inputs are of equal length.

4. Training the Model

To train the model, we can manipulate the data in batches using PyTorch’s DataLoader. We will also set the optimizer and loss function to train the model.

4.1 Preparing the Data Loader

python
import torch
from torch.utils.data import DataLoader, Dataset

class IMDbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = IMDbDataset(train_encodings, train_labels)
val_dataset = IMDbDataset(val_encodings, val_labels)

train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

A new dataset class is defined inheriting from the Dataset class, and we use DataLoader for batch processing. A batch size of 16 is used.

4.2 Setting Up Model Training

python
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

model.train()
for epoch in range(3):  # Total 3 epochs
    total_loss = 0
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
    print(f"Epoch: {epoch + 1}, Loss: {total_loss / len(train_loader)}")

We train the model using the AdamW optimization algorithm. The total loss is calculated and output for each epoch. In this example, training is done for 3 epochs.

5. Evaluating the Model

After training the model, we need to evaluate its performance on the validation data. This will help us determine how well the model generalizes.

5.1 Defining the Model Evaluation Function

python
from sklearn.metrics import accuracy_score

def evaluate_model(model, val_loader):
    model.eval()
    all_labels = []
    all_preds = []

    with torch.no_grad():
        for batch in val_loader:
            outputs = model(**batch)
            preds = outputs.logits.argmax(dim=-1)
            all_labels.extend(batch['labels'].tolist())
            all_preds.extend(preds.tolist())

    accuracy = accuracy_score(all_labels, all_preds)
    return accuracy

accuracy = evaluate_model(model, val_loader)
print("Validation Accuracy:", accuracy)

We define the evaluate_model function to assess the model’s performance. The accuracy on the validation data is printed to gauge the model’s performance.

6. Conclusion

In this course, we learned how to handle the IMDb movie reviews dataset using Hugging Face’s Transformer library. We looked at the entire process from splitting the dataset, training the model, and evaluating its performance. Through this process, we hope you gained a fundamental understanding of the NLP field. These techniques can be applied to various language models, enabling you to achieve better results.

Using Hugging Face Transformers Tutorial, BART Tokenization Before Learning Model

Recently, the fields of artificial intelligence and natural language processing (NLP) have made remarkable advancements.
In particular, the transformer architecture has brought innovative results to various NLP tasks.
In this article, we will take a closer look at the tokenization process, which plays a crucial role in data preprocessing, focusing on the
BART (Bidirectional and Auto-Regressive Transformers) model from the Hugging Face library.
BART is a model that combines the strengths of GPT and BERT in simple unsupervised learning and multiple supervised learning tasks,
and it is used for diverse tasks such as text summarization, translation, and question generation.

1. Understanding the BART Model

BART has an encoder-decoder structure and is primarily trained through both auto-regressive and denoising methods.
Thanks to this structure, BART can handle various changes in text well and performs excellently in various NLP tasks such as
text generation, summarization, and translation.
The key features of BART can be summarized as follows:

  • Both an encoder and decoder are present, allowing for flexible use in various tasks
  • A dynamic ability to generate words based on previous context
  • Learning various text features based on sufficient pre-trained data

2. The Necessity of Tokenization

The first step in training natural language processing models is to convert the data into an appropriate format.
A tokenizer splits the text into smaller units, or tokens, to help the model understand it.
This allows the model to better comprehend the relationship between sentences and words.
Tokenization is an essential process in BART and plays a significant role in preparing text data.

3. Installing the Hugging Face Library

Before starting the tokenization process, you need to install Hugging Face’s Transformers library.
You can easily install it with the command below.

pip install transformers

4. Using the BART Tokenizer

Now, let’s use the BART model’s tokenizer to tokenize some text.
Here, we will load BART’s pre-trained model and tokenize a text example,
printing out the tokens and their indices.

4.1. Python Code Example


from transformers import BartTokenizer

# Load the BART tokenizer
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

# Test text
text = "Deep learning is a field of artificial intelligence."

# Tokenization
tokens = tokenizer.tokenize(text)
print("Tokenization result:", tokens)

# Checking the token indices
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print("Token IDs:", token_ids)
    

Running the above code will yield results like the following:


Tokenization result: ['▁Deep', '▁learning', '▁is', '▁a', '▁field', '▁of', '▁artificial', '▁intelligence', '.']
Token IDs: [31050, 17381, 2804, 1839, 1507, 7138, 5390, 1839, 2269, 2252, 2872, 2]
    

5. The Decoding Process

After tokenization, you can restore the original sentence from the tokens through the decoding process before inputting data into the model.
The following code demonstrates how to decode indices back into the original sentence.


# Decoding the token IDs to the original sentence
decoded_text = tokenizer.decode(token_ids)
print("Decoding result:", decoded_text)
    

This allows us to recover the original sentence.
This process demonstrates how to prepare input that the model can understand and restore it back to its original form.

6. Text Summarization Using BART

After tokenization, the next step is to use the BART model to summarize the input text.
Users can provide input text to the model and obtain summarized results.
Below is a simple example of text summarization using BART.


from transformers import BartForConditionalGeneration

# Load the BART model
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

# Input text
input_text = "Artificial intelligence refers to tasks performed by machines that imitate human intelligence. This technology has made remarkable advancements in recent years."

# Tokenize the text and convert to indices
input_ids = tokenizer.encode(input_text, return_tensors='pt')

# Generate the summary
summary_ids = model.generate(input_ids, max_length=50, num_beams=5, early_stopping=True)

# Decode the summary result
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary result:", summary)
    

Running the above code will generate a summary of the input text.
The BART model has the ability to understand the input sentences and transform them into a concise format.

7. Conclusion

In this article, we covered the text tokenization process using the Hugging Face BART model and provided a simple summarization example.
Transformer models, including BART, exhibit excellent performance in effectively performing various tasks in natural language processing.
Recognizing the importance of tokenization in the data preprocessing process can enhance the learning efficiency of the model.
In the next article, we will discuss use cases of BART and additional application methods.

Thank you!

Training course on utilizing Hugging Face Transformers, BERT ensemble learning and prediction using training datasets

The advancement of Natural Language Processing (NLP) in the field of deep learning is
contributed to by various innovative models. One of them is BERT (Bidirectional Encoder Representations from Transformers).
BERT is exceptionally powerful in understanding context and demonstrates state-of-the-art performance in various NLP tasks,
including text classification, question answering, and sentiment analysis. In this course, we will explore how to ensemble learn
the BERT model using Hugging Face’s Transformers library and the prediction process involved.

1. Understanding the BERT Model

BERT is a pre-trained language model based on the Transformer architecture,
which does not have a typical directionality and encodes text bidirectionally to grasp context well.
The BERT model is pre-trained with two main tasks: the Masked Language Model and Next Sentence Prediction.

1.1 Masked Language Model

In the masked language model, some words in the input sentence are masked, and
the model is trained to predict the masked words.
This helps to understand the meaning of words based on context.

1.2 Next Sentence Prediction

In this task, two sentences are input to determine if they are consecutive sentences or not.
This helps to understand the relationship between sentences.

2. Introduction to Hugging Face Transformers

Hugging Face’s Transformers library is a framework that enables easy access to various NLP models worldwide.
This library offers various utilities for model loading, data processing, training, and prediction.
In particular, it has an interface that makes it easy to use BERT and other Transformer-based models.

3. Data Preparation

In this example, we will use the IMDB movie review dataset to build a model that predicts the sentiment of movie reviews (positive/negative).
We will utilize a publicly available dataset.
First, let’s examine the process of downloading and preprocessing the dataset.

3.1 Downloading and Preprocessing the Dataset

import pandas as pd
from sklearn.model_selection import train_test_split

# Download IMDB dataset
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
!wget {url} -O aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar.gz

# Load dataset
train_data = pd.read_csv("aclImdb/train.csv")
test_data = pd.read_csv("aclImdb/test.csv")

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(train_data['review'], train_data['label'], 
                                                    test_size=0.2, random_state=42)

4. Loading and Training the BERT Model

Now we are ready to load and train the BERT model.
The Hugging Face Transformers library allows us to easily use the BERT model.
First, we will load the model and tokenizer, and then transform the dataset into the BERT input format.

4.1 Loading the Model and Tokenizer

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

4.2 Tokenizing the Dataset

# Convert dataset to BERT input format
def tokenize_data(texts):
    return tokenizer(texts.tolist(), padding=True, truncation=True, return_tensors='pt')

train_encodings = tokenize_data(X_train)
test_encodings = tokenize_data(X_test)

5. Model Ensemble Learning

Model ensemble is a method of combining multiple models to achieve better performance.
We will train multiple models based on BERT and combine their predictions to derive the final result.
Below is the code to implement model ensemble.

5.1 Defining Training and Prediction Functions

def train_and_evaluate(model, train_encodings, labels):
    # Model training and evaluation logic
    inputs = {'input_ids': train_encodings['input_ids'],
              'attention_mask': train_encodings['attention_mask'],
              'labels': torch.tensor(labels.tolist())}
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
    model.train()
    
    for epoch in range(3):  # Training for several epochs
        outputs = model(**inputs)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f'Epoch: {epoch}, Loss: {loss.item()}')

def predict(model, test_encodings):
    model.eval()
    with torch.no_grad():
        outputs = model(**test_encodings)
        logits = outputs[0]
    return logits.argmax(dim=1)

5.2 Running the Model Ensemble

# List of models to ensemble
models = [BertForSequenceClassification.from_pretrained('bert-base-uncased') for _ in range(5)]
predictions = []

for model in models:
    train_and_evaluate(model, train_encodings, y_train)
    preds = predict(model, test_encodings)
    predictions.append(preds)

# Ensemble the prediction results
final_preds = torch.stack(predictions).mean(dim=0).round().long()

6. Result Analysis and Evaluation

We will evaluate the model’s performance based on the final prediction results.
Let’s calculate accuracy and visualize the confusion matrix to analyze the model’s prediction performance.

6.1 Performance Evaluation

from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

# Performance evaluation
accuracy = accuracy_score(y_test, final_preds)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Display confusion matrix
cm = confusion_matrix(y_test, final_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

7. Conclusion

In this course, we explored how to ensemble learn the BERT model using Hugging Face’s Transformers library.
We confirmed that BERT is a powerful model and that ensemble techniques can further enhance the model’s predictive performance.
We encourage you to utilize BERT in various NLP tasks and take the next steps forward.

References

Using Hugging Face Transformers Course, BERT Ensemble Learning and Prediction Beyond Learning Datasets

In this article, we will discuss how to perform ensemble learning using the BERT model provided by the Hugging Face Transformers library, and how this can improve prediction performance. Ensemble learning is a technique that aims to achieve better performance by combining the prediction results of several models. This tutorial will detail the process of implementing an ensemble by combining various BERT models.

1. Basics of Ensemble Learning

Ensemble learning is a method that combines multiple models to create the final prediction result. This approach leverages the strengths of each model to enhance the overall model performance. Common ensemble methods include the following techniques:

  • Bagging: Independently trains multiple models and improves performance by averaging the final prediction results.
  • Boosting: Increases the weights of the data that previous models mispredicted to train the next model.
  • Stacking: Uses the predictions of various models as new features to train a meta model for the final prediction.

2. Introduction to BERT Model

BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model based on Transformers that demonstrates excellent performance across various natural language processing (NLP) tasks. Two features of BERT are:

  • Bidirectionality: BERT learns context from both directions to understand word meanings more accurately.
  • Pre-training: It is pre-trained on vast amounts of data, making it capable of handling various tasks with just fine-tuning.

3. Preparing Data

The dataset prepared for ensemble learning should ideally address a simple natural language processing problem. For example, we will classify sentiment (positive/negative) using movie review data.

First, install the Hugging Face library and necessary packages:

!pip install transformers datasets torch scikit-learn

Loading the Dataset

Next, we will load the dataset using Hugging Face’s datasets library:

from datasets import load_dataset

dataset = load_dataset('imdb')
train_data = dataset['train']
test_data = dataset['test']

4. Model Setup

In this example, we will create two variant models based on the BERT model. This will allow us to achieve an ensemble effect. First, let’s write a function to load the BERT model:

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments

# Load BERT model and tokenizer
def load_model_and_tokenizer(model_name='bert-base-uncased'):
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertForSequenceClassification.from_pretrained(model_name)
    return tokenizer, model

5. Data Preprocessing

Let’s explain the process of preprocessing the text data for use as model input:

def preprocess_data(dataset, tokenizer, max_len=128):
    inputs = tokenizer(dataset['text'], padding=True, truncation=True, max_length=max_len, return_tensors='pt')
    inputs['labels'] = torch.tensor(dataset['label'])
    return inputs

train_inputs = preprocess_data(train_data, tokenizer)
test_inputs = preprocess_data(test_data, tokenizer)

6. Model Training

It is now time to train the model. We will use the previously loaded model and preprocessed data for training:

def train_model(model, train_inputs):
    training_args = TrainingArguments(
        output_dir='./results',
        num_train_epochs=3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=64,
        evaluation_strategy='epoch',
        logging_dir='./logs',
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_inputs,
    )
    trainer.train()

# Train the model
model1 = load_model_and_tokenizer()[1]
train_model(model1, train_inputs)

7. Training the Ensemble Model

Now we add a second model to perform ensemble learning. The BERT architecture remains the same, but we may use different initialization or hyperparameters:

model2 = load_model_and_tokenizer(model_name='bert-large-uncased')[1]
train_model(model2, train_inputs)

8. Ensemble Prediction

We ensemble the model prediction results to generate the final output. We average the predictions from the two models to obtain the final prediction:

import numpy as np

def ensemble_predict(models, inputs):
    preds = []
    for model in models:
        model.eval()
        with torch.no_grad():
            outputs = model(**inputs)
            preds.append(outputs.logits)
    
    ensemble_preds = np.mean(preds, axis=0)
    return ensemble_preds

models = [model1, model2]
predictions = ensemble_predict(models, test_inputs)

9. Performance Evaluation

Now we evaluate the performance of the ensemble model. Metrics such as accuracy or F1 score can be used:

from sklearn.metrics import accuracy_score, f1_score

# Retrieve the ground truth labels
labels = test_data['label']

# Calculate metrics based on ensemble predictions and labels
predicted_labels = np.argmax(predictions, axis=1)

accuracy = accuracy_score(labels, predicted_labels)
f1 = f1_score(labels, predicted_labels)

print(f'Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}')

10. Conclusion and Future Work

Through this tutorial, we learned about ensemble learning methods using the BERT model. We explored how combining the predictions of multiple models can improve performance. Future work may include:

  • Ensemble using more models
  • Improving preprocessing and data augmentation
  • Optimizing performance through hyperparameter tuning

Ensemble learning continues to be a promising method in the field of deep learning, achieving higher accuracy by mixing various models. As mentioned earlier, various experiments can be conducted to enhance performance using multiple BERT models.

References