A subfield of machine learning that uses multi-layer neural networks to learn complex patterns from raw data.
Deep Learning uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. Unlike classical ML, deep learning automatically learns features from raw data -- pixels, text, or audio -- without manual feature engineering.
Core idea: just like the human brain processes signals through layers of neurons, deep learning models learn using layers of artificial neurons.
| Algorithm | Full Form | Used For | Key Idea |
|---|---|---|---|
| ANN | Artificial Neural Network | Tabular data, basic problems | Input -> hidden -> output layers |
| DNN | Deep Neural Network | Any complex task | ANN with many hidden layers |
| MLP | Multilayer Perceptron | Classification & regression | Fully connected layers, no memory |
| CNN | Convolutional Neural Network | Image & video | Filters/kernels detect spatial patterns |
| RNN | Recurrent Neural Network | Time-series & sequences | Has memory; passes info through time |
| LSTM | Long Short-Term Memory | Long sequences | Improved RNN with memory gates |
| Transformers | -- | NLP, vision, LLMs | Self-attention; parallel processing |
| GANs | Generative Adversarial Networks | Image generation | Generator vs Discriminator |
After computing the weighted sum in a neuron, the activation function decides whether the neuron should "fire". Without it, the network is just a linear function.
Output: 0 to 1. Used for binary classification output layer. Limitation: vanishing gradient in deep networks.
f(x) = max(0, x). Default for hidden layers. Fast and simple. Limitation: dead neurons if input is always negative.
Allows small negative output to fix dead neuron problem. f(x) = x if x > 0, else 0.01x.
Converts outputs to probabilities summing to 1. Used for multi-class classification output layer.
A loss function measures how far the model's prediction is from the actual value. Smaller loss means better performance. The optimizer uses the loss to adjust weights via backpropagation.
MSE (Mean Squared Error) and MAE (Mean Absolute Error).
Binary Cross-Entropy (2 classes) and Categorical Cross-Entropy (multi-class).
An optimizer updates weights to minimize the loss function. It uses gradients computed via backpropagation.
| Optimizer | How It Works | When to Use |
|---|---|---|
| Gradient Descent | Updates after full dataset pass | Small datasets |
| SGD | Updates after each sample | Large datasets, noisy but fast |
| Mini-Batch GD | Updates after a batch (e.g., 32) | Most common in practice |
| AdaGrad | Adapts learning rate per parameter | Sparse data (text) |
| RMSProp | Fixes AdaGrad's shrinking LR | RNNs, non-stationary problems |
| Adam | Momentum + RMSProp combined | Default choice for most tasks |
Adam (Adaptive Moment Estimation) is the most widely used optimizer. It combines the benefits of momentum and adaptive learning rates. Use it as your default.
Learning rate is the single most important hyperparameter. Start with 0.001 (Adam default). If training loss oscillates, reduce it. If training is too slow, increase it.
| Good For | Not Ideal For |
|---|---|
| Images, video, audio, text | Small tabular datasets |
| Large datasets (thousands+ samples) | When interpretability is critical |
| Complex non-linear relationships | Low-compute environments |
| End-to-end feature learning | When classical ML works well enough |
ANNCNNRNNLSTMTransformersAdamBackpropagationReLU