With the recent advancements in artificial intelligence and machine learning, deep learning technologies are being utilized in many fields. In particular, in the field of Natural Language Processing (NLP), the Hugging Face Transformers library has made it easy to use various models. In this course, we will explain in detail the data preprocessing techniques using regular expressions along with an example of document classification using Hugging Face Transformers.
1. What is Hugging Face Transformers?
Hugging Face Transformers is a Python library that provides various deep learning models commonly used in Natural Language Processing (NLP). It includes many of the latest models such as BERT, GPT-2, and T5, designed for users to easily access and utilize. This library is written in Python, making it widely used by data scientists and researchers.
2. The Importance of Regular Expressions and Preprocessing
Regular expressions are a very useful tool for finding or transforming specific patterns in strings. By using regular expressions to remove unnecessary characters and perform pattern matching before inputting data into the model, the quality of the data can be improved. Preprocessing directly affects the model’s performance, so it requires sufficient attention.
3. Environment Setup
First, we will install Hugging Face Transformers and the necessary libraries. Run the command below to install the libraries:
pip install transformers pandas re
4. Preparing the Data
In this example, we will use a simple dataset for sentiment analysis. The data consists of sentences that represent positive and negative sentiments.
import pandas as pd
data = {
"text": [
"This product is really good!",
"Not great. I was very disappointed.",
"It's not a bad product.",
"I hope for a refund.",
"It really exceeded my expectations!",
],
"label": [1, 0, 1, 0, 1] # 1: positive, 0: negative
}
df = pd.DataFrame(data)
print(df)
5. Data Preprocessing Using Regular Expressions
Next, we will perform data preprocessing using regular expressions. For example, we will remove special characters or numbers and convert all characters to lowercase.
import re
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove special characters and numbers
text = re.sub(r'[^a-z가-힣\s]', '', text)
return text
df['cleaned_text'] = df['text'].apply(preprocess_text)
print(df[['text', 'cleaned_text']])
6. Training the Model Using Hugging Face Transformers
After preprocessing is complete, we will train a model for sentiment analysis using a transformer model. Below is an example code using the BERT model.
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
from sklearn.model_selection import train_test_split
# Split the data
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['label'], test_size=0.2, random_state=42)
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Tokenize the data
train_encodings = tokenizer(X_train.tolist(), padding=True, truncation=True, return_tensors='pt')
test_encodings = tokenizer(X_test.tolist(), padding=True, truncation=True, return_tensors='pt')
# Define PyTorch dataset class
class TextDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
# Prepare the dataset
train_dataset = TextDataset(train_encodings, y_train.tolist())
test_dataset = TextDataset(test_encodings, y_test.tolist())
# Set training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
)
# Define the trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
# Train the model
trainer.train()
7. Model Evaluation
After the model training is complete, you can evaluate the model’s performance. Calculate the accuracy and visualize the confusion matrix to analyze the model’s performance.
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# Perform predictions
predictions = trainer.predict(test_dataset)
preds = predictions.predictions.argmax(-1)
# Calculate accuracy
accuracy = accuracy_score(y_test, preds)
print(f'Accuracy: {accuracy:.2f}')
# Visualize the confusion matrix
cm = confusion_matrix(y_test, preds)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
8. Conclusion
In this course, we explained how to build a basic sentiment analysis model using the Hugging Face Transformers library. We saw how improving data quality through regular expression preprocessing can lead to high performance when using transformer models. It would be beneficial to continue working on projects utilizing various natural language processing technologies.
Thank you!