Data Preprocessing

Cleaning and transforming raw data so ML algorithms can learn effectively.

What is Data Preprocessing?

Real-world data is messy: missing values, inconsistent formats, different scales. Preprocessing cleans and organizes data so models can learn patterns correctly and efficiently.

Garbage in, garbage out. The quality of your preprocessing directly determines the quality of your model.

Why It Matters

Computers cannot handle blank or NaN values
Features with large numbers can overpower features with small numbers
Text categories must be converted to numbers
Clean data = faster training + better accuracy

1. Handling Missing Data

Missing values break most ML algorithms. You have two options: remove or fill.

Remove Rows

Drop rows with missing values. Only if few rows are affected and the data is not critical.

Fill (Imputation)

Mean: Average of the column (numeric).
Median: Middle value (good with outliers).
Mode: Most frequent value (categorical).

import pandas as pd # Load dataset data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv') # Check for missing values data.isnull().sum() # Fill missing Age with mean mean_age = data['Age'].mean() data['Age'] = data['Age'].fillna(mean_age) # Drop columns with too many missing values data.drop(columns=['Cabin'], inplace=True)

2. Feature Selection

Not all columns are useful. Drop irrelevant features to reduce noise and speed up training.

# Drop irrelevant columns columns_to_drop = ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked'] data.drop(columns=columns_to_drop, inplace=True) # Also drop Name (not useful for prediction) data.drop(columns=['Name'], inplace=True) print(data.columns) # Output: Index(['Survived', 'Sex', 'Age'], dtype='object')

3. Encoding Categorical Data

ML algorithms only understand numbers. Convert text categories to numeric values.

Label Encoding

Assign a number to each category. Use when categories have an inherent order, or for binary columns.

# Label encode binary column data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

One-Hot Encoding

Create a new binary column for each category. Use when categories have no inherent order.

df = pd.DataFrame({'Color': ['red', 'green', 'blue']}) # One-hot encode df = pd.get_dummies(df, columns=['Color']) # Result: Color_blue, Color_green, Color_red (each 0 or 1)

Label encoding can mislead models into thinking categories have order (e.g., red=0 < blue=2). Use one-hot encoding for nominal categories with no natural order.

4. Feature Scaling / Normalization

Features on different scales can cause algorithms (especially distance-based ones like KNN, SVM) to give disproportionate weight to large-valued features.

Min-Max Scaling

Scales values to [0, 1]. Formula: (x - min) / (max - min)

Standardization (Z-score)

Centers data to mean=0, std=1. Formula: (x - mean) / std

5. Feature Engineering

Create new features from existing ones to give the model more useful information.

# Create age groups from a continuous Age column def age_group(age): if age < 10: return "Child" elif age <= 18: return "Teen" elif age < 40: return "Young" else: return "Senior" data["AgeGroup"] = data["Age"].apply(age_group)

Preprocessing Pipeline Summary

Steps in Order

Load data and inspect with .head(), .info(), .isnull().sum()
Handle missing values — fill or drop
Drop irrelevant columns
Encode categorical features — label or one-hot encoding
Scale/normalize numerical features
Engineer new features if helpful
Split into train/test sets

Preprocessing Missing Data Encoding Scaling Feature Engineering