Natural Language Processing Using Deep Learning: Classifying Naver Movie Reviews with KoBERT
In recent years, with the rapid advancement of artificial intelligence (AI) technologies, significant progress has been made in the field of natural language processing (NLP). In particular, deep learning-based models have shown excellent performance in language understanding and generation. In this article, we will discuss how to classify Naver movie reviews using KoBERT, a model optimized for the Korean language based on BERT (Bidirectional Encoder Representations from Transformers).
1. Project Overview
The goal of this project is to classify whether user reviews of Naver movies are positive or negative based on the review data. Through this, participants can understand the basic concepts of natural language processing and how to use the KoBERT model, while also gaining hands-on experience in data preprocessing and model training.
2. Introduction to KoBERT
KoBERT is a model trained on Google’s BERT model, specifically optimized for the Korean language. BERT is based on two main components: the first is the ‘Masked Language Model,’ where certain words in a sentence are randomly masked so the model can predict these words. The second is ‘Next Sentence Prediction,’ which determines whether the second of two provided sentences is the next sentence following the first one. This transfer learning technique has proven effective in many natural language processing tasks.
3. Data Preparation
In this project, we will use Naver movie review data. This dataset consists of user reviews of movies along with corresponding positive or negative labels for those reviews. The data is provided in CSV format, and we will prepare the dataset after installing the necessary libraries.
import pandas as pd
# Load the dataset
df = pd.read_csv('naver_movie_reviews.csv')
df.head()
Each column of the dataset contains movie reviews and their corresponding sentiment labels. We need to undergo necessary preprocessing to analyze this data.
4. Data Preprocessing
Data preprocessing is a crucial step in machine learning. To convert review texts into a format suitable for the model, the following tasks are performed:
- Removing Stop Words: Eliminate common words that do not add meaning.
- Tokenization: Split sentences into words.
- Normalization: Standardize words with similar meanings.
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
# Load KoBERT tokenizer
tokenizer = BertTokenizer.from_pretrained('kykim/bert-kor-base')
# Separate review texts and labels
sentences = df['review'].values
labels = df['label'].values
# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.1, random_state=42)
5. Define Dataset Class
To train the KoBERT model using PyTorch, we define a dataset class. This class serves to transform the input data into a format that the model can process.
from torch.utils.data import Dataset
class NaverMovieDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_length=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = self.texts[idx]
label = self.labels[idx]
encoding = self.tokenizer(
text,
truncation=True,
padding='max_length',
max_length=self.max_length,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(0),
'attention_mask': encoding['attention_mask'].squeeze(0),
'labels': torch.tensor(label, dtype=torch.long)
}
6. Define Class for Model Training, Evaluation, and Prediction
We define a single class for training, evaluating, and predicting with the model to maintain clean code.
import torch
from torch.utils.data import DataLoader
from transformers import BertForSequenceClassification, AdamW
from sklearn.metrics import classification_report
class KoBERTSentimentClassifier:
def __init__(self, model_name='kykim/bert-kor-base', num_labels=2, learning_rate=1e-5):
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(self.device)
self.optimizer = AdamW(self.model.parameters(), lr=learning_rate)
def train(self, train_dataset, batch_size=16, epochs=3):
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
self.model.train()
for epoch in range(epochs):
for batch in train_dataloader:
self.optimizer.zero_grad()
input_ids = batch['input_ids'].to(self.device)
attention_mask = batch['attention_mask'].to(self.device)
labels = batch['labels'].to(self.device)
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
self.optimizer.step()
print(f"Epoch: {epoch + 1}, Loss: {loss.item()}")
def evaluate(self, test_dataset, batch_size=16):
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)
self.model.eval()
predictions, true_labels = [], []
with torch.no_grad():
for batch in test_dataloader:
input_ids = batch['input_ids'].to(self.device)
attention_mask = batch['attention_mask'].to(self.device)
labels = batch['labels'].to(self.device)
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
predictions.extend(torch.argmax(logits, dim=1).cpu().numpy())
true_labels.extend(labels.cpu().numpy())
print(classification_report(true_labels, predictions))
def predict(self, texts, tokenizer, max_length=128):
self.model.eval()
inputs = tokenizer(
texts,
truncation=True,
padding='max_length',
max_length=max_length,
return_tensors='pt'
).to(self.device)
with torch.no_grad():
outputs = self.model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
predictions = torch.argmax(outputs.logits, dim=1)
return predictions.cpu().numpy()
7. Conclusion
In this article, we explored the process of classifying Naver movie reviews using KoBERT. By learning how to process text data using deep learning-based natural language processing models, I hope this has provided a good opportunity to familiarize oneself with the fundamentals of natural language processing. Now, a foundation has been established to proceed with various natural language processing projects based on this technology.