Hugging Face Transformers Course, Preprocessing with Regular Expressions

With the recent advancements in artificial intelligence and machine learning, deep learning technologies are being utilized in many fields. In particular, in the field of Natural Language Processing (NLP), the Hugging Face Transformers library has made it easy to use various models. In this course, we will explain in detail the data preprocessing techniques using regular expressions along with an example of document classification using Hugging Face Transformers.

1. What is Hugging Face Transformers?

Hugging Face Transformers is a Python library that provides various deep learning models commonly used in Natural Language Processing (NLP). It includes many of the latest models such as BERT, GPT-2, and T5, designed for users to easily access and utilize. This library is written in Python, making it widely used by data scientists and researchers.

2. The Importance of Regular Expressions and Preprocessing

Regular expressions are a very useful tool for finding or transforming specific patterns in strings. By using regular expressions to remove unnecessary characters and perform pattern matching before inputting data into the model, the quality of the data can be improved. Preprocessing directly affects the model’s performance, so it requires sufficient attention.

3. Environment Setup

First, we will install Hugging Face Transformers and the necessary libraries. Run the command below to install the libraries:

pip install transformers pandas re

4. Preparing the Data

In this example, we will use a simple dataset for sentiment analysis. The data consists of sentences that represent positive and negative sentiments.

import pandas as pd

data = {
    "text": [
        "This product is really good!",
        "Not great. I was very disappointed.",
        "It's not a bad product.",
        "I hope for a refund.",
        "It really exceeded my expectations!",
    ],
    "label": [1, 0, 1, 0, 1]  # 1: positive, 0: negative
}

df = pd.DataFrame(data)
print(df)

5. Data Preprocessing Using Regular Expressions

Next, we will perform data preprocessing using regular expressions. For example, we will remove special characters or numbers and convert all characters to lowercase.

import re

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and numbers
    text = re.sub(r'[^a-z가-힣\s]', '', text)
    return text

df['cleaned_text'] = df['text'].apply(preprocess_text)
print(df[['text', 'cleaned_text']])

6. Training the Model Using Hugging Face Transformers

After preprocessing is complete, we will train a model for sentiment analysis using a transformer model. Below is an example code using the BERT model.

from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['label'], test_size=0.2, random_state=42)

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Tokenize the data
train_encodings = tokenizer(X_train.tolist(), padding=True, truncation=True, return_tensors='pt')
test_encodings = tokenizer(X_test.tolist(), padding=True, truncation=True, return_tensors='pt')

# Define PyTorch dataset class
class TextDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Prepare the dataset
train_dataset = TextDataset(train_encodings, y_train.tolist())
test_dataset = TextDataset(test_encodings, y_test.tolist())

# Set training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Define the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

# Train the model
trainer.train()

7. Model Evaluation

After the model training is complete, you can evaluate the model’s performance. Calculate the accuracy and visualize the confusion matrix to analyze the model’s performance.

from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Perform predictions
predictions = trainer.predict(test_dataset)
preds = predictions.predictions.argmax(-1)

# Calculate accuracy
accuracy = accuracy_score(y_test, preds)
print(f'Accuracy: {accuracy:.2f}')

# Visualize the confusion matrix
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

8. Conclusion

In this course, we explained how to build a basic sentiment analysis model using the Hugging Face Transformers library. We saw how improving data quality through regular expression preprocessing can lead to high performance when using transformer models. It would be beneficial to continue working on projects utilizing various natural language processing technologies.

Thank you!