허깅페이스 트렌스포머 활용강좌, 학습 데이터세트를 활용한 BERT 앙상블 학습 및 예측

딥러닝 분야에서 자연어 처리(Natural Language Processing, NLP)의 발전에는
다양한 혁신적 모델이 기여하고 있습니다. 그 중 하나가 바로 BERT(Bidirectional Encoder Representations from Transformers)입니다.
BERT는 문맥을 이해하는 데 매우 강력하며, 텍스트 분류, 질의응답, 감정 분석 등 다양한 NLP 과제에서
최첨단 성능을 보이고 있습니다. 이번 강좌에서는 허깅페이스의 트랜스포머 라이브러리를 이용해
BERT 모델을 앙상블 학습하는 방법과 이를 통한 예측 과정에 대해 살펴보겠습니다.

1. BERT 모델 이해하기

BERT는 Transformer 아키텍처를 기반으로 한 사전 학습된 언어 모델로,
일반적인 방향성을 가지고 있지 않고 양 방향으로 텍스트를 인코딩하여 문맥을 잘 파악합니다.
BERT 모델은 두 가지 주요 과제, 즉 마스크 언어 모델(Masked Language Model)과 다음 문장 예측(Next Sentence Prediction)으로 사전 학습됩니다.

1.1 마스크 언어 모델

마스크 언어 모델에서는 입력 문장에서 일부 단어를 마스킹하고,
모델이 마스크된 단어를 예측하도록 학습합니다.
이를 통해 문맥에 따른 단어의 의미를 파악할 수 있습니다.

1.2 다음 문장 예측

이 과제에서는 두 문장을 입력받아 두 문장이 연속적인 문장인지 아닌지를 판별합니다.
이를 통해 문장 간의 관계를 이해할 수 있게 됩니다.

2. 허깅페이스 트랜스포머 소개

허깅페이스의 트랜스포머 라이브러리는 전 세계 다양한 NLP 모델을 손쉽게 사용할 수 있는 프레임워크입니다.
이 라이브러리는 모델 로드와 데이터 처리, 학습, 예측을 위한 다양한 유틸리티를 제공합니다.
특히 BERT 및 다른 Transformer 기반 모델들을 쉽게 사용할 수 있는 인터페이스를 갖추고 있습니다.

3. 데이터 준비

이번 예제에서는 IMDB 영화 리뷰 데이터셋을 사용하여 영화 리뷰의 긍정/부정을 예측하는 모델을 구축하겠습니다.
데이터셋은 공개된 데이터셋을 활용할 것입니다.
먼저 데이터셋을 다운로드하고 전처리하는 과정을 살펴봅시다.

3.1 데이터셋 다운로드 및 전처리

import pandas as pd
from sklearn.model_selection import train_test_split

# IMDB 데이터셋 다운로드
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"
!wget {url} -O aclImdb_v1.tar.gz
!tar -xvf aclImdb_v1.tar.gz

# 데이터셋 로드
train_data = pd.read_csv("aclImdb/train.csv")
test_data = pd.read_csv("aclImdb/test.csv")

# 학습 데이터와 테스트 데이터로 분리
X_train, X_test, y_train, y_test = train_test_split(train_data['review'], train_data['label'], 
                                                    test_size=0.2, random_state=42)

4. BERT 모델 로드 및 학습

이제 BERT 모델을 로드하고 학습할 준비가 되었습니다.
허깅페이스 트랜스포머 라이브러리를 통해 BERT 모델을 쉽게 사용할 수 있습니다.
먼저 모델과 토크나이저를 로드한 후, 데이터셋을 BERT 입력 형식으로 변환하겠습니다.

4.1 모델 및 토크나이저 로드

from transformers import BertTokenizer, BertForSequenceClassification
import torch

# BERT 모델과 토크나이저 로드
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

4.2 데이터셋 토큰화

# 데이터셋을 BERT 입력 형식으로 변환
def tokenize_data(texts):
    return tokenizer(texts.tolist(), padding=True, truncation=True, return_tensors='pt')

train_encodings = tokenize_data(X_train)
test_encodings = tokenize_data(X_test)

5. 모델 앙상블 학습

모델 앙상블은 여러 개의 모델을 결합하여 더 나은 성능을 달성하는 방법입니다.
BERT를 기반으로 여러 개의 모델을 학습시키고 이들의 예측 결과를 결합하여 최종 결과를 도출합니다.
다음은 모델 앙상블을 구현하는 코드입니다.

5.1 학습 및 예측 함수 정의

def train_and_evaluate(model, train_encodings, labels):
    # 모델 훈련 및 평가 로직
    inputs = {'input_ids': train_encodings['input_ids'],
              'attention_mask': train_encodings['attention_mask'],
              'labels': torch.tensor(labels.tolist())}
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
    model.train()
    
    for epoch in range(3):  # 여러 에폭 동안 학습
        outputs = model(**inputs)
        loss = outputs[0]
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        print(f'Epoch: {epoch}, Loss: {loss.item()}')

def predict(model, test_encodings):
    model.eval()
    with torch.no_grad():
        outputs = model(**test_encodings)
        logits = outputs[0]
    return logits.argmax(dim=1)

5.2 모델 앙상블 실행

# 앙상블할 모델 목록
models = [BertForSequenceClassification.from_pretrained('bert-base-uncased') for _ in range(5)]
predictions = []

for model in models:
    train_and_evaluate(model, train_encodings, y_train)
    preds = predict(model, test_encodings)
    predictions.append(preds)

# 예측 결과 앙상블
final_preds = torch.stack(predictions).mean(dim=0).round().long()

6. 결과 분석 및 평가

최종 예측 결과를 바탕으로 모델의 성능을 평가하겠습니다.
정확도를 계산하고 confusion matrix를 시각화하여 모델의 예측 성능을 분석해봅시다.

6.1 성능 평가

from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

# 성능 평가
accuracy = accuracy_score(y_test, final_preds)
print(f'Accuracy: {accuracy * 100:.2f}%')

# Confusion matrix 출력
cm = confusion_matrix(y_test, final_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()

7. 결론

이번 강좌에서는 허깅페이스 트랜스포머 라이브러리를 활용하여 BERT 모델을 앙상블 학습하는 방법을 살펴보았습니다.
BERT는 강력한 성능을 가진 모델이며, 앙상블 기법을 통해 모델의 예측 성능을 더욱 향상시킬 수 있음을 확인할 수 있었습니다.
다양한 NLP 태스크에 BERT를 활용하여 다음 단계로 나아가보시기 바랍니다.