Deep Learning PyTorch Course, Preprocessing, Tokenization

Deep learning models learn from data, so it’s very important to properly prepare the input data. Especially in fields like Natural Language Processing (NLP), preprocessing and tokenization are essential for handling text data. In this course, we will cover the concepts and practices of data preprocessing and tokenization using PyTorch.

1. Importance of Data Preprocessing

Data preprocessing is the process of collecting raw data and converting it to be suitable for model training. This is important for the following reasons:

  • Noise Reduction: Raw data often contains unnecessary information. Preprocessing removes this information to improve model performance.
  • Consistency Maintenance: Converting various formats of data into a consistent format makes it easier for the model to understand the data.
  • Speed Improvement: Reducing the amount of unnecessary data can speed up the training process.

2. Preprocessing Steps

Data preprocessing typically includes the following steps:

  • Text Cleaning: Converting to lowercase, removing punctuation, handling stop words, etc.
  • Normalization: Unifying words with the same meaning (e.g., “rich”, “wealthy” → “rich”)
  • Tokenization: Splitting sentences into words or subword units

2.1 Text Cleaning

Text cleaning is the process of reducing noise and achieving a consistent format. These tasks can be performed using Python’s regular expression library.

import re

def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text

sample_text = "Hello! Welcome to the world of deep learning. #DeepLearning #Python"
cleaned_text = clean_text(sample_text)
print(cleaned_text)  # "hello welcome to the world of deep learning deeplearning python"
    

2.2 Normalization

Normalization is the process of unifying semantically similar words. For example, words such as ‘good’, ‘nice’, and ‘fine’ can be unified to ‘goodness’. This transformation can be done using predefined rules.

def normalize_text(text):
    normalization_map = {
        'good': 'goodness',
        'nice': 'goodness',
        'fine': 'goodness',
    }
    words = text.split()
    normalized_words = [normalization_map.get(word, word) for word in words]
    return ' '.join(normalized_words)

normalized_text = normalize_text("This movie is very good. Really nice.")
print(normalized_text)  # "This movie is very goodness. Really goodness."
    

3. Tokenization

The process of splitting text into words or subword units. Tokenization is typically the first step in NLP. There are various methods such as word tokenization, subword tokenization, etc.

3.1 Word-based Tokenization

This is the most basic form of tokenization, which splits sentences based on spaces. It can be easily implemented using Python’s built-in functions.

def word_tokenize(text):
    return text.split()

tokens = word_tokenize(normalized_text)
print(tokens)  # ['This', 'movie', 'is', 'very', 'goodness.', 'Really', 'goodness.']
    

3.2 Subword-based Tokenization

Subword tokenization is a method widely used in modern models such as BERT. It breaks words into smaller units to mitigate the problem of rare words. The SentencePiece library in Python can be used for this.

!pip install sentencepiece

import sentencepiece as spm

# Train subword model
spm.SentencePieceTrainer.Train('--input=corpus.txt --model_prefix=m --vocab_size=5000')

# Load model and tokenize
sp = spm.SentencePieceProcessor()
sp.load('m.model')

text = "Hello, I am learning deep learning."
subword_tokens = sp.encode(text, out_type=str)
print(subword_tokens)  # ['▁Hello', ',', '▁I', '▁am', '▁learning', '▁deep', '▁learning', '.']
    

4. Preparing Datasets and Utilizing PyTorch (DataLoader)

The cleaned and tokenized data above can be transformed into a dataset for PyTorch. This facilitates batch processing during deep learning model training.

import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

texts = ["This movie is goodness", "This movie is bad"]
labels = [1, 0]  # Positive: 1, Negative: 0
dataset = TextDataset(texts, labels)

data_loader = DataLoader(dataset, batch_size=2, shuffle=True)

for batch in data_loader:
    print(batch)  # (['This movie is goodness', 'This movie is bad'], [1, 0])
    

5. Conclusion

In this course, we explored text data preprocessing and tokenization using PyTorch. Since data preprocessing and tokenization directly impact the performance of deep learning models, they are essential foundational knowledge to master. Based on this, we will cover actual model building and training processes in future lessons.

6. References