K-Nearest Neighbors (KNN) is a very simple and intuitive algorithm in machine learning and deep learning,
which finds the K nearest neighbors for a given data point and makes predictions based on the labels of those neighbors.
KNN is primarily used for classification problems but can also be applied to regression problems.
1. Basic Principle of KNN
The basic idea of the KNN algorithm is as follows. When trying to classify a given sample,
the K closest data points to that sample are selected.
Based on the information provided by these K data points, the label of the new sample is determined.
For example, if K is 3, the labels of the 3 nearest neighbors to the given sample are checked,
and the most common label among them is selected.
1.1 Distance Measurement Methods
To find neighbors in KNN, the distance between two data points must be measured.
Commonly used distance measurement methods are as follows:
- Euclidean Distance: Defined as the distance between two points (x1, y1) and (x2, y2).
- Manhattan Distance: Defined as the sum of the absolute differences between two points.
- Minkowski Distance: A generalized distance metric that includes both Euclidean and Manhattan distances.
2. Advantages and Disadvantages of KNN
2.1 Advantages
- It is simple and intuitive to implement.
- Since no model training is needed, it can predict immediately.
- It performs well even on non-linear data.
2.2 Disadvantages
- Prediction speed decreases with large datasets.
- The choice of K value significantly affects the results.
- Performance may degrade in high-dimensional data (curse of dimensionality).
3. Implementing KNN with PyTorch
In this section, we will learn how to implement KNN using PyTorch.
We will install the necessary libraries and prepare the required dataset.
3.1 Installing Necessary Libraries
pip install torch numpy scikit-learn
3.2 Preparing the Dataset
We will implement KNN using the breast cancer dataset.
We will load the breast cancer dataset provided by scikit-learn.
import numpy as np
import torch
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target
# Split the dataset (training/testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Normalize the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
3.3 Implementing the KNN Algorithm
Now we will implement the KNN algorithm.
First, we will define a class that performs KNN.
class KNN:
def __init__(self, k=3):
self.k = k
def fit(self, X, y):
self.X_train = X
self.y_train = y
def predict(self, X):
distances = []
for x in X:
distance = np.sqrt(np.sum((self.X_train - x) ** 2, axis=1))
distances.append(distance)
distances = np.array(distances)
neighbors = np.argsort(distances)[:, :self.k]
return np.array([self.y_train[neighbor].mode()[0] for neighbor in neighbors])
3.4 Model Training and Prediction
I will show you the process of training and making predictions using the KNN model.
# Create KNN model
knn = KNN(k=3)
# Fit the model with the training data
knn.fit(X_train, y_train)
# Predict using the testing data
predictions = knn.predict(X_test)
# Calculate accuracy
accuracy = np.mean(predictions == y_test)
print(f'Model Accuracy: {accuracy * 100:.2f}%')
4. Improving KNN
Let’s look at a few methods to enhance the performance of KNN.
For example, adjusting the K value or changing the distance metric are some options.
Additionally, reducing the dimensionality of the data can also improve performance.
4.1 Adjusting K Value
The K value significantly impacts the performance of the KNN algorithm.
Setting K too low can lead to overfitting,
while setting it too high can reduce generalization performance.
Therefore, it’s essential to find the optimal K value using cross-validation techniques.
4.2 Changing the Distance Metric
In addition to Euclidean distance, Manhattan distance, and Minkowski distance can be used.
It is important to choose the most suitable distance measurement method through experimentation.
4.3 Dimensionality Reduction
Using dimensionality reduction techniques like PCA (Principal Component Analysis) can improve KNN’s performance by reducing the dimensionality of the data.
When the dimensionality is high, it not only becomes difficult to visually understand the data, but it also increases the complexity of calculations.
5. The Relationship between KNN and Deep Learning
The KNN algorithm can be used alongside deep learning.
For instance, the output of a deep learning model can be connected to the basic layer of KNN
to create a more efficient classifier.
Furthermore, information can be extracted from the neighbor data of KNN to be used as features in deep learning.
6. Conclusion
K-Nearest Neighbors (KNN) is a fundamental algorithm in machine learning,
and its implementation and understanding are very straightforward.
However, it is crucial to understand the algorithm’s drawbacks, especially the performance issues on large datasets and high-dimensional data,
and to know how to improve them.
I hope this article has helped you build a basic understanding of KNN and provided you with the opportunity to implement KNN in practice through PyTorch.