Deep Learning PyTorch Course, Clustering

The advancement of deep learning technology has been accompanied by the development of data analysis and processing techniques. Among them, clustering is a very useful method for finding hidden patterns in data and grouping similar data together. In this article, we will explore the basics to advanced techniques of clustering using PyTorch in depth.

1. Basics of Clustering

Clustering is a technique that divides a given dataset into several clusters based on similarity. In this process, each cluster contains very similar data internally, but is distinctly different from other clusters. Representative examples of clustering algorithms include K-Means, Hierarchical Clustering, and DBSCAN.

1.1 K-Means Clustering

K-Means Clustering is one of the most widely used clustering methods, aiming to divide the data into K clusters. This method is performed through the following steps:

Set the number of clusters K.
Randomly select K initial cluster centers (centroids).
Assign each data point to the nearest cluster center.
Update the center of each cluster to the average of the current data points.
Repeat steps 2-4 until the cluster centers no longer change.

2. Implementing K-Means Clustering with PyTorch

Now, let’s implement K-Means Clustering using PyTorch. Below is a basic code example for K-Means Clustering.

2.1 Generating Data

import numpy as np
import matplotlib.pyplot as plt

# Generate data
np.random.seed(0)
n_samples = 500
random_data = np.random.rand(n_samples, 2)
plt.scatter(random_data[:, 0], random_data[:, 1], s=10)
plt.title("Randomly Generated Data")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

2.2 Implementing K-Means Algorithm

class KMeans:
    def __init__(self, n_clusters=3, max_iters=100):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        
    def fit(self, data):
        # Randomly select initial centroids
        self.centroids = data[np.random.choice(data.shape[0], self.n_clusters, replace=False)]
        for i in range(self.max_iters):
            # Cluster assignment
            distances = np.linalg.norm(data[:, np.newaxis] - self.centroids, axis=2)
            self.labels = np.argmin(distances, axis=1)
            # Update centroids
            new_centroids = np.array([data[self.labels == j].mean(axis=0) for j in range(self.n_clusters)])
            if np.all(self.centroids == new_centroids):
                break
            self.centroids = new_centroids

    def predict(self, data):
        distances = np.linalg.norm(data[:, np.newaxis] - self.centroids, axis=2)
        return np.argmin(distances, axis=1)

2.3 Training the Model

# Train K-Means Clustering model
kmeans = KMeans(n_clusters=3)
kmeans.fit(random_data)

# Visualize clusters
plt.scatter(random_data[:, 0], random_data[:, 1], c=kmeans.labels, s=10)
plt.scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], c='red', s=100, marker='X')
plt.title("K-Means Clustering Result")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

3. Evaluating Clusters

Evaluating the results of clustering is very important. While there are many evaluation metrics, several key metrics commonly used include:

Silhouette Score: Evaluates the cohesion and separation of clusters. The closer to 1, the better.
Euclidean Distance: Measures the average distance of clusters to assess the quality of clustering.

3.1 Calculating Silhouette Score

from sklearn.metrics import silhouette_score

# Calculate Silhouette Score
score = silhouette_score(random_data, kmeans.labels)
print(f"Silhouette Score: {score:.2f}")

4. Advanced Clustering Techniques

In addition to basic K-Means clustering, various advanced clustering techniques have been developed. Here, we will look at some of them.

4.1 DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that defines clusters based on the density of points. This method is robust to noise and is effective even when the shape of clusters is not spherical.

4.2 Hierarchical Clustering

Hierarchical clustering performs clustering in a hierarchical structure. This method works by merging or splitting clusters based on similarity between them. As a result, a dendrogram (hierarchical structure graph) can be produced to visually determine the number of clusters.

4.3 Implementing DBSCAN in Python

from sklearn.cluster import DBSCAN

# Train DBSCAN model
dbscan = DBSCAN(eps=0.3, min_samples=5)
dbscan_labels = dbscan.fit_predict(random_data)

# Visualize DBSCAN results
plt.scatter(random_data[:, 0], random_data[:, 1], c=dbscan_labels, s=10)
plt.title("DBSCAN Clustering Result")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()

5. Conclusion

In this lecture, we learned about the implementation and evaluation methods of K-Means Clustering using PyTorch, as well as advanced clustering techniques. Clustering is one of the important techniques for data analysis and processing in various fields, and through it, we can gain insights into the structure and patterns of data. We recommend applying various clustering techniques to real data in the future.

We hope to gain deeper insights through continuous research and learning in deep learning and machine learning. Thank you.