Data Preprocessing
Cleaning and transforming raw data so ML algorithms can learn effectively.
What is Data Preprocessing?
Real-world data is messy: missing values, inconsistent formats, different scales. Preprocessing cleans and organizes data so models can learn patterns correctly and efficiently.
Garbage in, garbage out. The quality of your preprocessing directly determines the quality of your model.
Why It Matters
- Computers cannot handle blank or NaN values
- Features with large numbers can overpower features with small numbers
- Text categories must be converted to numbers
- Clean data = faster training + better accuracy
1. Handling Missing Data
Missing values break most ML algorithms. You have two options: remove or fill.
Remove Rows
Drop rows with missing values. Only if few rows are affected and the data is not critical.
Fill (Imputation)
Mean: Average of the column (numeric).
Median: Middle value (good with outliers).
Mode: Most frequent value (categorical).
import pandas as pd
# Load dataset
data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
# Check for missing values
data.isnull().sum()
# Fill missing Age with mean
mean_age = data['Age'].mean()
data['Age'] = data['Age'].fillna(mean_age)
# Drop columns with too many missing values
data.drop(columns=['Cabin'], inplace=True)
2. Feature Selection
Not all columns are useful. Drop irrelevant features to reduce noise and speed up training.
# Drop irrelevant columns
columns_to_drop = ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked']
data.drop(columns=columns_to_drop, inplace=True)
# Also drop Name (not useful for prediction)
data.drop(columns=['Name'], inplace=True)
print(data.columns)
# Output: Index(['Survived', 'Sex', 'Age'], dtype='object')
3. Encoding Categorical Data
ML algorithms only understand numbers. Convert text categories to numeric values.
Label Encoding
Assign a number to each category. Use when categories have an inherent order, or for binary columns.
# Label encode binary column
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
One-Hot Encoding
Create a new binary column for each category. Use when categories have no inherent order.
df = pd.DataFrame({'Color': ['red', 'green', 'blue']})
# One-hot encode
df = pd.get_dummies(df, columns=['Color'])
# Result: Color_blue, Color_green, Color_red (each 0 or 1)
Label encoding can mislead models into thinking categories have order (e.g., red=0 < blue=2). Use one-hot encoding for nominal categories with no natural order.
4. Feature Scaling / Normalization
Features on different scales can cause algorithms (especially distance-based ones like KNN, SVM) to give disproportionate weight to large-valued features.
Min-Max Scaling
Scales values to [0, 1]. Formula: (x - min) / (max - min)
Standardization (Z-score)
Centers data to mean=0, std=1. Formula: (x - mean) / std
5. Feature Engineering
Create new features from existing ones to give the model more useful information.
# Create age groups from a continuous Age column
def age_group(age):
if age < 10:
return "Child"
elif age <= 18:
return "Teen"
elif age < 40:
return "Young"
else:
return "Senior"
data["AgeGroup"] = data["Age"].apply(age_group)
Preprocessing Pipeline Summary
Steps in Order
- Load data and inspect with
.head(), .info(), .isnull().sum()
- Handle missing values — fill or drop
- Drop irrelevant columns
- Encode categorical features — label or one-hot encoding
- Scale/normalize numerical features
- Engineer new features if helpful
- Split into train/test sets
Preprocessing Missing Data Encoding Scaling Feature Engineering