ML Playground / KNN View Notebook

K-Nearest Neighbors (KNN)

A simple algorithm that classifies data points based on the majority vote of their K closest neighbors.

What is KNN?

KNN is an instance-based algorithm used for both classification and regression. It makes predictions by finding the K training examples closest to a new data point and taking a majority vote (classification) or average (regression).

KNN has no training phase. It stores all training data and does all the work at prediction time. This makes it a "lazy learner."

How It Works

Algorithm Steps
  1. Choose K (number of nearest neighbors)
  2. Calculate distance from the new point to all training points (usually Euclidean distance)
  3. Find the K closest training points
  4. Classification: Take the majority class among K neighbors
  5. Regression: Take the average value of K neighbors
Euclidean Distance = sqrt( (x1-x2)^2 + (y1-y2)^2 + ... ) Example: If K=3 and the 3 closest neighbors are [Red, Red, Blue] --> Classify as Red (majority vote)

Why Feature Scaling Matters

KNN relies on distance calculations. If one feature has a range of 0-100 and another 0-1, the larger feature dominates the distance. Always scale features before using KNN.

Code Implementation

import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import accuracy_score, classification_report # Dataset data = { "Age": [22, 25, 28, 30, 32, 35, 40, 45, 50, 55], "Income_LPA": [16, 8, 10, 9, 15, 12, 22, 26, 10, 35], "Buys_House": [1, 0, 0, 0, 0, 1, 1, 1, 0, 1] } df = pd.DataFrame(data) # Features and target X = df[["Age", "Income_LPA"]] y = df["Buys_House"] # Train-test split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Feature scaling (critical for KNN) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Train KNN model knn = KNeighborsClassifier(n_neighbors=2, weights="distance") knn.fit(X_train_scaled, y_train) # Predict and evaluate y_pred = knn.predict(X_test_scaled) accuracy = accuracy_score(y_test, y_pred) report = classification_report(y_test, y_pred, zero_division=0) print(f"Model Accuracy: {accuracy * 100:.2f}%") print("Classification Report:\n", report) # Predict for a new person new_data = scaler.transform([[21, 2]]) prediction = knn.predict(new_data) print("Can buy house" if prediction[0] == 1 else "Cannot buy house")

Choosing K

Key Parameters

ParameterDescription
n_neighborsNumber of neighbors (K). Default is 5
weights"uniform" (all equal) or "distance" (closer neighbors have more influence)
metricDistance metric: "euclidean", "manhattan", "minkowski"

When to Use KNN

Good ForNot Ideal For
Small to medium datasetsLarge datasets (slow prediction)
Non-linear decision boundariesHigh-dimensional data (curse of dimensionality)
Quick baseline modelWhen training speed matters
Multi-class classificationWhen features are not scaled

KNN becomes very slow with large datasets because it calculates distance to every training point for each prediction. For large datasets, consider using KD-trees or Ball trees (set algorithm="kd_tree").

Classification Regression Supervised Instance-based Lazy Learner