ML Playground

A hands-on journey through Machine Learning, Deep Learning, NLP, and Prompt Engineering. 30 interactive notebooks you can read right here as web pages.

Author: Gokul

Notebooks

Learning Path

Intro→ Preprocessing→ Supervised→ Unsupervised→ Reinforcement→ Deep Learning→ NLP & Prompts

📚

Introduction to ML

Foundational concepts. Understand supervised, unsupervised, and reinforcement learning before diving into algorithms.

Introduction to Machine Learning

Read

Overview of the ML landscape. Three main types, regression vs classification, ensemble learning (bagging & boosting).

When to Use What

Regression Predict a number (price, temperature)

Classification Predict a category (spam / not spam)

Clustering Find groups in unlabeled data

Dimensionality Reduction Reduce features for visualization

Reinforcement Learning Train an agent (game, robot)

Bagging (Random Forest) High-variance model (overfitting)

Boosting (XGBoost) High-bias model (underfitting)

⚙

Data Preprocessing

Clean data, handle missing values, encode categories, and engineer features before any model.

Data Preprocessing & Feature Engineering

Read

Titanic dataset. Missing data imputation, feature scaling (normalization/standardization), label & one-hot encoding, custom feature engineering.

Key Techniques

Missing Data Imputation Fill gaps with mean, median, or mode to prevent algorithm breakage

Label Encoding Map categories to numbers for ordered or binary features

One-Hot Encoding Create binary columns for unordered categorical features

Min-Max Scaling Scale values to [0,1] range to prevent feature dominance

Standardization Normalize to mean=0, std=1 for scale-sensitive algorithms

Feature Engineering Create new features from existing data to boost model performance

Dataset: Titanic (891 samples)pandas, sklearn

🎯

Supervised Learning

Learn from labeled data. Map inputs to known outputs — predicting prices, classifying spam, and more.

Linear Regression

Read

Fit a straight line to predict house prices. y = β₀ + β₁x. Train-test split, MSE, R² evaluation, regression line visualization.

Key Concepts

Simple Linear Regression Fit a line in 2D space for single-feature continuous prediction

Multiple Linear Regression Fit a hyperplane with multiple features

Ordinary Least Squares Minimize squared differences between actual and predicted

Gradient Descent Iteratively adjust coefficients to minimize error

MSE & R² Evaluate prediction error and variance explained

House Prices (10 samples)sklearn

Logistic Regression

Read

Classification using the sigmoid function. Binary & multi-class with one-vs-rest. Decision boundaries and log-odds transform.

Key Concepts

Sigmoid Function Transform linear output to [0,1] probability for classification

Log Loss Optimization Minimize cross-entropy loss for optimal classification

Feature Scaling Essential to ensure convergence and unbiased coefficients

Probability Thresholding Convert probabilities to class labels using 0.5 threshold

Decision Boundary Line separating classes in feature space

Decision Tree

Read

Tree-shaped model splitting by questions. Gini impurity, Entropy, MSE criteria. Tree depth control with visual plot_tree diagram.

Key Concepts

Classification Tree Interpretable decisions through recursive feature-based splits

Regression Tree Predict continuous values using tree structure with leaf values

Gini Impurity Measure split quality by misclassification probability

Entropy / Information Gain Measure disorder to identify optimal splits

Max Depth Regulation Limit tree depth to prevent overfitting

Age & Income (7 samples)sklearn

K-Nearest Neighbors (KNN)

Read

Classify by majority vote of K closest neighbors. Euclidean distance, why feature scaling is critical. Interactive new predictions.

Key Concepts

K-Nearest Neighbors Classify based on majority vote of K closest training points

Euclidean Distance Calculate distances to identify nearest neighbors

Feature Scaling Critical for distance-based algorithms to prevent feature dominance

K Selection Start with K=sqrt(n), use odd values for binary to prevent ties

Weighted KNN Give closer neighbors more influence in predictions

n_neighbors=2sklearn

Random Forest

Read

Ensemble of decision trees with bagging. Basic regression + extended with categorical encoding. Feature importance. R² = 0.96.

Key Concepts

Bagging Train trees on random data subsets with replacement for diversity

Random Feature Selection Consider random features per split to reduce tree correlation

Classification Majority voting across ensemble for robust predictions

Regression Average predictions across ensemble for robust continuous values

Feature Importance Rank features by contribution to reduce variance

R²=0.96sklearn

Support Vector Machine (SVM / SVR)

Read

Maximize margin between classes. Kernel trick (RBF), support vectors, ε-insensitive tube for regression. Apartment rent prediction.

Key Concepts

Margin Maximization Find optimal hyperplane that maximizes gap between classes

Support Vectors Closest data points to boundary that define the hyperplane

Linear Kernel Use for linearly separable data (fastest option)

RBF Kernel Map to infinite dimensions for non-linear separation

SVR (Regression) Fit hyperplane within epsilon-tube for regression tolerance

Apartments (7 samples)sklearn

Gradient Boosting

Read

End-to-end SMS spam detection. NLTK tokenization, stopword removal, TF-IDF vectorization, GradientBoostingClassifier. 97% accuracy.

Key Concepts

Sequential Boosting Train trees one after another, each correcting previous errors

Residual Learning New trees predict errors of the combined previous predictions

Learning Rate Shrinkage Scale each tree's contribution to prevent overfitting

Weak Base Learners Use shallow trees to benefit from the boosting effect

Early Stopping Stop training when validation performance plateaus

SMS Spam (5,572 samples)97% Accuracy

XGBoost

Read

Extreme Gradient Boosting with L1 (Lasso) vs L2 (Ridge) regularization. Visualizes how regularization affects weights. Parallelization.

Key Concepts

XGBoost Optimized gradient boosting with regularization and parallelization

L1 Regularization (Lasso) Penalize absolute weights for automatic feature elimination

L2 Regularization (Ridge) Penalize squared weights to smoothly shrink without zeroing

Missing Value Handling Learns optimal direction for missing data automatically

Sparse Data Efficiency Efficiently handles sparse matrices like TF-IDF

House Pricesxgboost

Hyperparameter Tuning

Read

Three approaches compared: GridSearchCV (exhaustive), RandomizedSearchCV (sampling), Bayesian Optimization with Optuna (intelligent).

Key Concepts

Grid Search Test every hyperparameter combination (slow but guarantees best)

Random Search Sample random combinations efficiently (faster, comparable results)

Bayesian Optimization Intelligently search using probability to focus on promising regions

Cross-Validation Use multiple folds within tuning for robust estimates

Decision Tree paramssklearn, optuna

ML Evaluation Metrics

Read

Classification: Accuracy, Precision, Recall, F1, Confusion Matrix, ROC-AUC. Regression: MAE, MSE, RMSE, R². When to use which.

When to Use Which

Accuracy Correct prediction percentage (only reliable on balanced datasets)

Precision Optimize when false positives are costly (e.g. spam filter)

Recall Optimize when false negatives are costly (e.g. cancer detection)

F1 Score Harmonic mean balancing precision and recall for imbalanced data

ROC-AUC Performance across all thresholds (1.0=perfect, 0.5=random)

MAE / MSE / RMSE Regression error metrics in original units

R² Fraction of variance explained by the model (0=baseline, 1=perfect)

Naive Bayes

Read

Probabilistic classifier based on Bayes' theorem. Gaussian, Multinomial, Bernoulli variants. Fast and effective for text, spam detection, sentiment. Laplace smoothing explained.

Key Concepts

Bayes' Theorem Calculate posterior probability from likelihood, prior, and evidence

Gaussian NB Use for continuous features following normal distributions

Multinomial NB Use for discrete count features like word frequencies in text

Bernoulli NB Use for binary feature presence/absence

Laplace Smoothing Add pseudocounts to prevent zero probabilities for unseen features

Cross-Validation

Read

K-Fold, Stratified K-Fold, Leave-One-Out, Time Series Split. Get reliable performance estimates instead of one lucky train-test split.

Key Concepts

K-Fold CV Split data into K folds, train K times with rotating test sets

Stratified K-Fold Maintain class distribution in each fold (essential for imbalanced data)

Leave-One-Out (LOO) Use each single sample as test set (maximum thoroughness)

Time Series Split Expanding train window for temporal data (preserve causality)

Feature Selection

Read

Filter, wrapper, and embedded methods. SelectKBest, RFE, tree importance, Lasso L1. Pick the best features and drop the noise.

Key Concepts

Filter Methods (SelectKBest) Score features independently using statistical tests (fast)

Mutual Information Measure information dependency between features and target

RFE (Recursive Elimination) Iteratively remove least important features from model

Tree-Based Importance Extract importance scores from tree models for ranking

L1 Lasso Automatically drives unimportant weights to zero

ARIMA & Prophet

Read

Classical time series forecasting. ARIMA(p,d,q), stationarity testing, auto_arima. Facebook Prophet with holidays, multiple seasonality, component plots.

Key Concepts

AR (AutoRegressive) Use past values to forecast future values

I (Integrated) Difference data to remove trends and achieve stationarity

MA (Moving Average) Incorporate past forecast errors for prediction

SARIMA Extend ARIMA with seasonal components (P,D,Q,s)

Auto ARIMA Automatically determine optimal (p,d,q) parameters

Facebook Prophet User-friendly forecasting with automatic trend and seasonality

🔭

Unsupervised Learning

Discover hidden patterns in unlabeled data. Clustering, dimensionality reduction, and association rules.

K-Means Clustering

Read

Centroid-based partitioning. k-means++ init, Elbow method for K, Silhouette score. Iterative assignment & update until convergence.

Key Concepts

K-Means Clustering Partition data into K clusters by minimizing within-cluster variance

Elbow Method Determine optimal K by plotting inertia vs number of clusters

Silhouette Score Evaluate cluster quality from -1 to 1

k-means++ Smarter centroid initialization for better starting positions

300 samples, 3 clusterssklearn

Hierarchical Clustering

Read

Agglomerative (bottom-up) & Divisive (top-down). Linkage methods: Single, Complete, Average, Ward. Dendrogram tree diagrams.

Key Concepts

Agglomerative Bottom-up approach starting from individual points and merging

Divisive Top-down approach starting from one cluster and splitting

Ward's Linkage Minimizes total within-cluster variance (most popular)

Dendrogram Tree visualization showing hierarchical cluster structure

Age & Income (10 samples)scipy, sklearn

DBSCAN

Read

Density-based clustering. No need to specify K. Finds arbitrary shapes, marks outliers as noise (-1). ε and MinPts parameters.

Key Concepts

Core Points Points with at least MinPts neighbors within eps radius

Border Points Within eps of core points but with fewer than MinPts neighbors

Epsilon (eps) Radius parameter defining neighborhood size around each point

Noise Detection Automatically marks outliers that don't belong to any cluster

Gaussian Mixture Models (GMM)

Read

Soft probabilistic clustering via EM algorithm. Each point gets a probability per cluster. Gaussian distributions with μ, Σ, π.

Key Concepts

Soft Clustering Assigns probability of belonging to each cluster (not hard labels)

EM Algorithm Iteratively estimates Gaussian parameters to maximize likelihood

Covariance Types Full, tied, diag, or spherical options for different cluster shapes

Component Weights Learned proportions of data belonging to each Gaussian

300 samples, 3 clusterssklearn

Mean Shift Clustering

Read

Density peak-seeking. Each point shifts towards the mean of nearby points. Auto-discovers cluster count. Bandwidth parameter.

Key Concepts

Density Peak Seeking Iteratively shift points toward regions of highest density

Bandwidth Parameter Controls window size: small = many clusters, large = fewer

Auto Cluster Detection No need to pre-specify number of clusters

Arbitrary Shapes Can find non-spherical cluster shapes

300 samples, 3 clusterssklearn

Principal Component Analysis (PCA)

Read

Dimensionality reduction via maximum variance projection. MNIST 64D → 2D. Eigenvalues, eigenvectors, linear transformation.

Key Concepts

PCA Find axes capturing maximum variance for dimensionality reduction

Explained Variance Ratio Shows how much information each component captures

Eigenvalues & Eigenvectors Eigenvectors are principal components, eigenvalues show importance

Feature Scaling Essential preprocessing to standardize data before PCA

MNIST (1,797 samples, 64→2D)sklearn

Association Rule Mining

Read

Market basket analysis. Apriori, Eclat, FP-Growth compared. Support, Confidence, Lift metrics. Real-world product bundling.

Key Concepts

Apriori Generate frequent itemsets level-by-level using pruning

FP-Growth Compressed tree-based method, much faster than Apriori

Support Frequency of itemset in all transactions

Confidence How often a rule is correct (antecedent → consequent)

Lift Strength of association compared to random chance (>1 is positive)

4 transactionsmlxtend

Isolation Forest

Read

Anomaly detection by random partitioning. Anomalies are isolated with fewer splits. Anomaly scores, contamination parameter. Fraud and intrusion detection.

Key Concepts

Isolation Forest Detect anomalies by isolating outliers with random partitions

Anomaly Score Normalized path length showing how easily a point is isolated

Random Partitioning Uses feature selection and split values to isolate points

Contamination Expected proportion of outliers in dataset

Recommendation Systems

Read

Collaborative filtering (user-user, item-item), content-based filtering with TF-IDF, matrix factorization. Build Netflix/Amazon-style recommendations.

Key Concepts

Collaborative Filtering "Users who liked what you liked also liked this"

Content-Based Filtering "Because you liked X, here's another similar X"

Matrix Factorization Decompose sparse user-item matrix into latent factors

🎮

Reinforcement Learning

An agent learns by trial and error, receiving rewards for good actions and penalties for bad ones.

Introduction to Reinforcement Learning

Read

Core vocabulary: Agent, Environment, State, Action, Reward, Policy, Episode. Model-free vs model-based. Real-world examples.

Key Concepts

Agent-Environment Loop Agent observes state, takes action, receives reward, updates policy

Policy Strategy mapping states to actions; goal is to find the optimal one

Reward Signal Feedback indicating quality of actions; guides learning

Exploration vs Exploitation Tradeoff between trying new actions and using best known

Model-Free vs Model-Based Learning from experience vs planning with environment model

Q-Learning

Read

Model-free RL with Q-table. 4×4 grid world navigation. Bellman equation, epsilon-greedy exploration. 500 episodes to optimal path.

Key Concepts

Q-Table Stores expected cumulative reward for each state-action pair

Bellman Equation Update rule for Q-values incorporating rewards and future values

Epsilon-Greedy Balance exploration (random) vs exploitation (best known)

Off-Policy Learning Learns optimal policy regardless of current exploration behavior

4×4 Grid500 Episodes

Deep Q-Network (DQN)

Read

Q-Learning + neural networks for large state spaces. Experience replay, target networks, epsilon-greedy. The approach that mastered Atari games.

Key Concepts

Deep Q-Network Combines Q-Learning with neural networks for large state spaces

Experience Replay Store and randomly sample past transitions for stable training

Target Network Separate network updated infrequently for stability

Neural Approximation Replaces Q-table with network predicting Q-values

🧠

Deep Learning

Neural networks with multiple layers. From basic perceptrons to CNNs for images and LSTMs for sequences.

Deep Learning Overview

Read

Neural network fundamentals, activation functions (ReLU, Sigmoid, Tanh, Softmax), backpropagation, common architectures.

Key Concepts

ANN Multi-layer networks with input, hidden, output layers

CNN Spatial pattern detection using filters/kernels for images

RNN Process sequences with memory via hidden states

Transformers Self-attention based architecture, parallelizable across sequence

Backpropagation Algorithm for computing gradients and updating weights

Neural Network Basics (ANN)

Read

MNIST digit classification. 784 → 128 (ReLU) → 64 (ReLU) → 10 (Softmax). Adam optimizer. 97.78% accuracy in 10 epochs.

Key Concepts

Feedforward Network Data flows one direction through layers with no recurrence

ReLU Activation max(0, x) introduces non-linearity in hidden layers

Softmax Activation Converts outputs to probabilities summing to 1 for multi-class

Adam Optimizer Adaptive learning rate combining momentum with adaptive estimates

MNIST (60K train)97.78% Accuracy

CNN Image Classification

Read

CIFAR-10 color images. Conv2D → MaxPool → Conv2D → MaxPool → Conv2D → Dense. Feature extraction, translation invariance. 71.86%.

Key Concepts

Conv2D Layers Filter slides across image computing dot products at each position

MaxPooling Takes maximum value in window, reducing spatial dimensions

Feature Maps Output of convolutional layers representing learned features

Depth-wise Architecture Early layers learn edges, deeper layers learn objects

CIFAR-10 (50K, 32×32 RGB)71.86% Accuracy

RNN Sequence Modeling

Read

Next-word prediction. Embedding → SimpleRNN(50) → Dense. N-gram sequences, hidden states, vanishing gradient problem.

Key Concepts

Simple RNN Processes sequences step-by-step with hidden state carrying info

Recurrent Connection Previous hidden state feeds as input to next step

Embedding Layer Converts word indices to dense vector representations

Vanishing Gradient Information loss in long sequences (solved by LSTM)

Text Prediction200 Epochs

LSTM (Long Short-Term Memory)

Read

Gated architecture solving vanishing gradients. Input, Forget, Output gates. Cell state as long-term memory. Text generation.

Key Concepts

Forget Gate Decides what past information to discard from cell state

Input Gate Decides what new information to add to cell state

Output Gate Decides what cell state information to output

Cell State "Conveyor belt" carrying information through sequence unchanged

Text GenerationTensorFlow/Keras

LSTM for Time Series

Read

Sales forecasting with stacked LSTMs. LSTM(50) → LSTM(25) → Dense(1). Sliding window lookback=5, MinMaxScaler.

Key Concepts

Stacked LSTM Multiple LSTM layers with return_sequences for deeper temporal learning

Sliding Window Uses lookback window of past values to predict next

MinMaxScaler Scales values to [0,1] range for neural network training

MSE Loss Mean Squared Error appropriate for regression tasks

100 sales points50 Epochs

Multi-Layer Perceptron (MLP)

Read

MLPRegressor for house prices. Hidden (64, 32) with ReLU, Adam. Why StandardScaler is essential for NNs. R² = 0.97.

Key Concepts

MLP Fully connected layers for supervised learning on tabular data

Hidden Layers Intermediate layers learning non-linear feature transformations

Feature Scaling StandardScaler normalizes features, essential for neural networks

Linear Output No activation for regression; softmax/sigmoid for classification

10 housesR²=0.97

GANs (Generative Adversarial Networks)

Read

Generator vs Discriminator competing to produce realistic data. DCGAN, WGAN, CycleGAN, StyleGAN variants. Training challenges: mode collapse, instability.

Key Concepts

Generator Creates synthetic data from random noise trying to fool discriminator

Discriminator Classifies real vs fake data, evaluating generator quality

Minimax Game Generator minimizes, discriminator maximizes loss function

Mode Collapse Generator learns limited diversity of outputs (training problem)

Training Balance Networks must stay balanced; discriminator can't be too strong/weak

Autoencoders

Read

Encoder-decoder networks learning compressed representations. Vanilla, Denoising, Variational (VAE), Convolutional variants. Anomaly detection, image denoising.

Key Concepts

Encoder Compresses input into lower-dimensional bottleneck representation

Decoder Reconstructs input from bottleneck latent representation

Bottleneck Layer Forced compression learning most important features

Denoising Autoencoder Learns to reconstruct clean data from corrupted input

Reconstruction Loss Difference between input and output drives learning

Transfer Learning

Read

Reuse pre-trained models (ResNet, VGG, MobileNet, EfficientNet). Feature extraction vs fine-tuning strategies. When to freeze, when to unfreeze.

Key Concepts

Feature Extraction Freeze pre-trained layers, add custom head for new task

Fine-Tuning Unfreeze and retrain some/all layers with low learning rate

Pre-trained Models Leverage knowledge from ImageNet or other large datasets

Early Layers Learn universal features (edges, textures) shared across tasks

Later Layers Learn task-specific features needing adaptation

Transformer Architecture

Read

Self-attention, multi-head attention, positional encoding. The architecture behind GPT, BERT, and all modern LLMs. Encoder-decoder explained step by step.

Key Concepts

Self-Attention Each position attends to all other positions computing relevance

Multi-Head Attention Multiple attention heads learning different relationships

Query, Key, Value Three learned projections for attention computation

Positional Encoding Adds position information since processing is fully parallel

Encoder-Decoder Encoder processes input, decoder generates output

Regularization Techniques

Read

Prevent overfitting. L1/L2 regularization, Dropout, Batch Normalization, Early Stopping, Data Augmentation, Weight Decay. When to use each.

When to Use Each

L1 (Lasso) Adds sum of absolute weights; drives some to zero for sparsity

L2 (Ridge) Adds sum of squared weights; shrinks all proportionally

Dropout Randomly deactivates neurons during training, forcing redundancy

Batch Normalization Normalizes layer outputs to zero mean / unit variance

Early Stopping Stop training when validation loss increases

Data Augmentation Create synthetic variations of training data

🚀

Advanced Topics

NLP and Prompt Engineering — where classical ML meets modern AI.

NLP Fundamentals

Read

Full pipeline: tokenization, stopwords, stemming vs lemmatization, BoW, TF-IDF. Word embeddings: Word2Vec, GloVe, FastText, BERT.

Key Concepts

Tokenization Split text into discrete units (words, characters, subwords)

Bag of Words Count word frequencies ignoring order and context

TF-IDF Weight terms by importance in document vs corpus

Word2Vec Learn dense embeddings capturing semantic relationships

BERT Embeddings Contextual embeddings where same word gets different vectors per context

NLTK, sklearn, gensimText → Vectors

Prompt Engineering

Read

The most comprehensive notebook. Zero/few-shot, Chain-of-Thought, Tree of Thoughts, ReAct, Reflexion. APIs for GPT-4, Claude, Gemini. Debugging & domain-specific prompts.

Key Techniques

Zero-Shot Task performance with instructions only, no examples needed

Few-Shot Provide 2-5 examples showing expected pattern and format

Chain-of-Thought Instruct model to reason step-by-step before answering

Self-Consistency Ask same question multiple times, majority vote on answer

Tree of Thoughts Explore multiple reasoning branches evaluating promises

16+ sectionsMost comprehensive

BERT / GPT Fine-Tuning

Read

HuggingFace Transformers for sentiment classification, NER, QA. Pre-trained BERT fine-tuning with Trainer API. Pipeline for instant inference.

Key Concepts

BERT Fine-Tuning Adapt bidirectional transformer for classification, NER, QA