K-Nearest Neighbors (KNN)
A simple algorithm that classifies data points based on the majority vote of their K closest neighbors.
What is KNN?
KNN is an instance-based algorithm used for both classification and regression. It makes predictions by finding the K training examples closest to a new data point and taking a majority vote (classification) or average (regression).
KNN has no training phase. It stores all training data and does all the work at prediction time. This makes it a "lazy learner."
How It Works
Algorithm Steps
- Choose K (number of nearest neighbors)
- Calculate distance from the new point to all training points (usually Euclidean distance)
- Find the K closest training points
- Classification: Take the majority class among K neighbors
- Regression: Take the average value of K neighbors
Euclidean Distance = sqrt( (x1-x2)^2 + (y1-y2)^2 + ... )
Example: If K=3 and the 3 closest neighbors are [Red, Red, Blue]
--> Classify as Red (majority vote)
Why Feature Scaling Matters
KNN relies on distance calculations. If one feature has a range of 0-100 and another 0-1, the larger feature dominates the distance. Always scale features before using KNN.
Code Implementation
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
# Dataset
data = {
"Age": [22, 25, 28, 30, 32, 35, 40, 45, 50, 55],
"Income_LPA": [16, 8, 10, 9, 15, 12, 22, 26, 10, 35],
"Buys_House": [1, 0, 0, 0, 0, 1, 1, 1, 0, 1]
}
df = pd.DataFrame(data)
# Features and target
X = df[["Age", "Income_LPA"]]
y = df["Buys_House"]
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature scaling (critical for KNN)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN model
knn = KNeighborsClassifier(n_neighbors=2, weights="distance")
knn.fit(X_train_scaled, y_train)
# Predict and evaluate
y_pred = knn.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, zero_division=0)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
print("Classification Report:\n", report)
# Predict for a new person
new_data = scaler.transform([[21, 2]])
prediction = knn.predict(new_data)
print("Can buy house" if prediction[0] == 1 else "Cannot buy house")
Choosing K
- Small K (e.g., 1-3): More sensitive to noise, risk of overfitting
- Large K: Smoother decision boundary, risk of underfitting
- Rule of thumb: Start with K = sqrt(n) where n is the number of training samples
- Always use odd K for binary classification to avoid ties
Key Parameters
| Parameter | Description |
| n_neighbors | Number of neighbors (K). Default is 5 |
| weights | "uniform" (all equal) or "distance" (closer neighbors have more influence) |
| metric | Distance metric: "euclidean", "manhattan", "minkowski" |
When to Use KNN
| Good For | Not Ideal For |
| Small to medium datasets | Large datasets (slow prediction) |
| Non-linear decision boundaries | High-dimensional data (curse of dimensionality) |
| Quick baseline model | When training speed matters |
| Multi-class classification | When features are not scaled |
KNN becomes very slow with large datasets because it calculates distance to every training point for each prediction. For large datasets, consider using KD-trees or Ball trees (set algorithm="kd_tree").
Classification Regression Supervised Instance-based Lazy Learner