Deep Learning for Natural Language Processing: Classifying Naver Movie Reviews using KoBERT

Natural Language Processing Using Deep Learning: Classifying Naver Movie Reviews with KoBERT

In recent years, with the rapid advancement of artificial intelligence (AI) technologies, significant progress has been made in the field of natural language processing (NLP). In particular, deep learning-based models have shown excellent performance in language understanding and generation. In this article, we will discuss how to classify Naver movie reviews using KoBERT, a model optimized for the Korean language based on BERT (Bidirectional Encoder Representations from Transformers).

1. Project Overview

The goal of this project is to classify whether user reviews of Naver movies are positive or negative based on the review data. Through this, participants can understand the basic concepts of natural language processing and how to use the KoBERT model, while also gaining hands-on experience in data preprocessing and model training.

2. Introduction to KoBERT

KoBERT is a model trained on Google’s BERT model, specifically optimized for the Korean language. BERT is based on two main components: the first is the ‘Masked Language Model,’ where certain words in a sentence are randomly masked so the model can predict these words. The second is ‘Next Sentence Prediction,’ which determines whether the second of two provided sentences is the next sentence following the first one. This transfer learning technique has proven effective in many natural language processing tasks.

3. Data Preparation

In this project, we will use Naver movie review data. This dataset consists of user reviews of movies along with corresponding positive or negative labels for those reviews. The data is provided in CSV format, and we will prepare the dataset after installing the necessary libraries.

import pandas as pd

# Load the dataset
df = pd.read_csv('naver_movie_reviews.csv')
df.head()

Each column of the dataset contains movie reviews and their corresponding sentiment labels. We need to undergo necessary preprocessing to analyze this data.

4. Data Preprocessing

Data preprocessing is a crucial step in machine learning. To convert review texts into a format suitable for the model, the following tasks are performed:

  • Removing Stop Words: Eliminate common words that do not add meaning.
  • Tokenization: Split sentences into words.
  • Normalization: Standardize words with similar meanings.
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer

# Load KoBERT tokenizer
tokenizer = BertTokenizer.from_pretrained('kykim/bert-kor-base')

# Separate review texts and labels
sentences = df['review'].values
labels = df['label'].values

# Split into training and testing data
X_train, X_test, y_train, y_test = train_test_split(sentences, labels, test_size=0.1, random_state=42)

5. Define Dataset Class

To train the KoBERT model using PyTorch, we define a dataset class. This class serves to transform the input data into a format that the model can process.

from torch.utils.data import Dataset

class NaverMovieDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(0),
            'attention_mask': encoding['attention_mask'].squeeze(0),
            'labels': torch.tensor(label, dtype=torch.long)
        }

6. Define Class for Model Training, Evaluation, and Prediction

We define a single class for training, evaluating, and predicting with the model to maintain clean code.

import torch
from torch.utils.data import DataLoader
from transformers import BertForSequenceClassification, AdamW
from sklearn.metrics import classification_report

class KoBERTSentimentClassifier:
    def __init__(self, model_name='kykim/bert-kor-base', num_labels=2, learning_rate=1e-5):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(self.device)
        self.optimizer = AdamW(self.model.parameters(), lr=learning_rate)

    def train(self, train_dataset, batch_size=16, epochs=3):
        train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
        self.model.train()
        for epoch in range(epochs):
            for batch in train_dataloader:
                self.optimizer.zero_grad()
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['labels'].to(self.device)
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
                loss = outputs.loss
                loss.backward()
                self.optimizer.step()
                print(f"Epoch: {epoch + 1}, Loss: {loss.item()}")

    def evaluate(self, test_dataset, batch_size=16):
        test_dataloader = DataLoader(test_dataset, batch_size=batch_size)
        self.model.eval()
        predictions, true_labels = [], []
        with torch.no_grad():
            for batch in test_dataloader:
                input_ids = batch['input_ids'].to(self.device)
                attention_mask = batch['attention_mask'].to(self.device)
                labels = batch['labels'].to(self.device)
                outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits
                predictions.extend(torch.argmax(logits, dim=1).cpu().numpy())
                true_labels.extend(labels.cpu().numpy())
        print(classification_report(true_labels, predictions))

    def predict(self, texts, tokenizer, max_length=128):
        self.model.eval()
        inputs = tokenizer(
            texts,
            truncation=True,
            padding='max_length',
            max_length=max_length,
            return_tensors='pt'
        ).to(self.device)
        with torch.no_grad():
            outputs = self.model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'])
            predictions = torch.argmax(outputs.logits, dim=1)
        return predictions.cpu().numpy()

7. Conclusion

In this article, we explored the process of classifying Naver movie reviews using KoBERT. By learning how to process text data using deep learning-based natural language processing models, I hope this has provided a good opportunity to familiarize oneself with the fundamentals of natural language processing. Now, a foundation has been established to proceed with various natural language processing projects based on this technology.