Deep Learning PyTorch Course, Preprocessing, Stemming

Deep learning is a technology used to create predictive models by learning from vast amounts of data. The performance of deep learning models is heavily influenced by the quality and quantity of the data, making data preprocessing a very important process. In this course, we will explore the preprocessing of text data used in deep learning and stemming, a frequently used technique in natural language processing. Additionally, we will implement this through practical example code using Python and the PyTorch library.

1. Data Preprocessing

Data preprocessing is the process of refining and processing raw data, which can enhance the learning performance of the model. The preprocessing of text data consists of the following steps:

Data collection: Methods for collecting actual data (crawling, API, etc.).
Data cleansing: Removing unnecessary characters, standardizing case, handling duplicate data.
Tokenization: Splitting text into words or sentences.
Stemming and Lemmatization: Transforming the form of words to their base form.
Indexing: Converting text data into numerical format.

1.1 Data Collection

Data collection is the first step in natural language processing (NLP), and data can be collected through various methods. For example, news articles can be obtained through web scraping or data can be collected via public APIs.

1.2 Data Cleansing

Data cleansing is the process of removing noise from raw data to create clean data. In this step, actions such as removing HTML tags, eliminating unnecessary symbols, and processing numbers will be performed.

Python Example: Data Cleansing


import re

def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove special characters
    text = re.sub(r'[^a-zA-Z0-9가-힣\s]', '', text)
    # Standardize case
    text = text.lower()
    return text

sample_text = "Hello, this is a deep learning course!! Starting data cleansing."
cleaned_text = clean_text(sample_text)
print(cleaned_text)

2. Stemming and Lemmatization

In natural language processing, stemming and lemmatization are primarily used. Stemming is a method that removes prefixes and suffixes from words to convert them into their root form. In contrast, lemmatization converts words into their appropriate base form according to context.

2.1 Stemming

Stemming is a method used to shorten words while maintaining their meaning. In Python, it can be easily implemented using libraries such as NLTK.

Python Example: Stemming


from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

words = ["running", "runner", "ran", "easily", "fairly"]
stems = [stemmer.stem(word) for word in words]
print(stems)

2.2 Lemmatization

Lemmatization converts words into their appropriate base form based on their part of speech. This allows for a semantic analysis of morphemes.

Python Example: Lemmatization


from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ["running", "runner", "ran", "easily", "fairly"]
lemmas = [lemmatizer.lemmatize(word, pos='v') for word in words]
print(lemmas)

3. Applying Preprocessing in PyTorch

PyTorch is a deep learning framework characterized by dealing with data in tensor format. Preprocessed data can be applied to the PyTorch dataset for model training.

Python Example: Data Preprocessing in PyTorch


import torch
from torch.utils.data import Dataset, DataLoader

class TextDataset(Dataset):
    def __init__(self, texts):
        self.texts = texts

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index):
        text = self.texts[index]
        # Apply stemming or lemmatization
        cleaned_text = clean_text(text)
        return cleaned_text

# Sample data
texts = [
    "I am feeling very good today.",
    "Deep learning is truly an interesting topic."
]

dataset = TextDataset(texts)
dataloader = DataLoader(dataset, batch_size=2)

for data in dataloader:
    print(data)

4. Conclusion

To enhance the performance of deep learning models, data preprocessing is essential. By applying correct preprocessing, the quality of data can be improved, and stemming and lemmatization are important techniques for natural language processing. We encourage you to apply the methods introduced in this course to actual data and further utilize them for training deep learning models.