Deep learning models learn from data, so it’s very important to properly prepare the input data. Especially in fields like Natural Language Processing (NLP), preprocessing and tokenization are essential for handling text data. In this course, we will cover the concepts and practices of data preprocessing and tokenization using PyTorch.
1. Importance of Data Preprocessing
Data preprocessing is the process of collecting raw data and converting it to be suitable for model training. This is important for the following reasons:
- Noise Reduction: Raw data often contains unnecessary information. Preprocessing removes this information to improve model performance.
- Consistency Maintenance: Converting various formats of data into a consistent format makes it easier for the model to understand the data.
- Speed Improvement: Reducing the amount of unnecessary data can speed up the training process.
2. Preprocessing Steps
Data preprocessing typically includes the following steps:
- Text Cleaning: Converting to lowercase, removing punctuation, handling stop words, etc.
- Normalization: Unifying words with the same meaning (e.g., “rich”, “wealthy” → “rich”)
- Tokenization: Splitting sentences into words or subword units
2.1 Text Cleaning
Text cleaning is the process of reducing noise and achieving a consistent format. These tasks can be performed using Python’s regular expression library.
import re
def clean_text(text):
# Convert to lowercase
text = text.lower()
# Remove punctuation
text = re.sub(r'[^a-z0-9\s]', '', text)
return text
sample_text = "Hello! Welcome to the world of deep learning. #DeepLearning #Python"
cleaned_text = clean_text(sample_text)
print(cleaned_text) # "hello welcome to the world of deep learning deeplearning python"
2.2 Normalization
Normalization is the process of unifying semantically similar words. For example, words such as ‘good’, ‘nice’, and ‘fine’ can be unified to ‘goodness’. This transformation can be done using predefined rules.
def normalize_text(text):
normalization_map = {
'good': 'goodness',
'nice': 'goodness',
'fine': 'goodness',
}
words = text.split()
normalized_words = [normalization_map.get(word, word) for word in words]
return ' '.join(normalized_words)
normalized_text = normalize_text("This movie is very good. Really nice.")
print(normalized_text) # "This movie is very goodness. Really goodness."
3. Tokenization
The process of splitting text into words or subword units. Tokenization is typically the first step in NLP. There are various methods such as word tokenization, subword tokenization, etc.
3.1 Word-based Tokenization
This is the most basic form of tokenization, which splits sentences based on spaces. It can be easily implemented using Python’s built-in functions.
def word_tokenize(text):
return text.split()
tokens = word_tokenize(normalized_text)
print(tokens) # ['This', 'movie', 'is', 'very', 'goodness.', 'Really', 'goodness.']
3.2 Subword-based Tokenization
Subword tokenization is a method widely used in modern models such as BERT. It breaks words into smaller units to mitigate the problem of rare words. The SentencePiece
library in Python can be used for this.
!pip install sentencepiece
import sentencepiece as spm
# Train subword model
spm.SentencePieceTrainer.Train('--input=corpus.txt --model_prefix=m --vocab_size=5000')
# Load model and tokenize
sp = spm.SentencePieceProcessor()
sp.load('m.model')
text = "Hello, I am learning deep learning."
subword_tokens = sp.encode(text, out_type=str)
print(subword_tokens) # ['▁Hello', ',', '▁I', '▁am', '▁learning', '▁deep', '▁learning', '.']
4. Preparing Datasets and Utilizing PyTorch (DataLoader)
The cleaned and tokenized data above can be transformed into a dataset for PyTorch. This facilitates batch processing during deep learning model training.
import torch
from torch.utils.data import Dataset, DataLoader
class TextDataset(Dataset):
def __init__(self, texts, labels):
self.texts = texts
self.labels = labels
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
return self.texts[idx], self.labels[idx]
texts = ["This movie is goodness", "This movie is bad"]
labels = [1, 0] # Positive: 1, Negative: 0
dataset = TextDataset(texts, labels)
data_loader = DataLoader(dataset, batch_size=2, shuffle=True)
for batch in data_loader:
print(batch) # (['This movie is goodness', 'This movie is bad'], [1, 0])
5. Conclusion
In this course, we explored text data preprocessing and tokenization using PyTorch. Since data preprocessing and tokenization directly impact the performance of deep learning models, they are essential foundational knowledge to master. Based on this, we will cover actual model building and training processes in future lessons.
6. References
- Deep Learning, Ian Goodfellow et al.
- Natural Language Processing with PyTorch, Delip Rao and Graham Neubig
- PyTorch Data Processing Documentation