Deep Learning Overview

A subfield of machine learning that uses multi-layer neural networks to learn complex patterns from raw data.

What is Deep Learning?

Deep Learning uses artificial neural networks with multiple layers (hence "deep") to learn complex patterns from data. Unlike classical ML, deep learning automatically learns features from raw data -- pixels, text, or audio -- without manual feature engineering.

Core idea: just like the human brain processes signals through layers of neurons, deep learning models learn using layers of artificial neurons.

Neural Network Structure

Input Layer --> Hidden Layers --> Output Layer Each layer has: Neurons (nodes) - mini math units Weights - control importance of inputs Biases - offset added to weighted input Activation Function - introduces non-linearity

Deep Learning Algorithm Map

Algorithm	Full Form	Used For	Key Idea
ANN	Artificial Neural Network	Tabular data, basic problems	Input -> hidden -> output layers
DNN	Deep Neural Network	Any complex task	ANN with many hidden layers
MLP	Multilayer Perceptron	Classification & regression	Fully connected layers, no memory
CNN	Convolutional Neural Network	Image & video	Filters/kernels detect spatial patterns
RNN	Recurrent Neural Network	Time-series & sequences	Has memory; passes info through time
LSTM	Long Short-Term Memory	Long sequences	Improved RNN with memory gates
Transformers	--	NLP, vision, LLMs	Self-attention; parallel processing
GANs	Generative Adversarial Networks	Image generation	Generator vs Discriminator

Activation Functions

After computing the weighted sum in a neuron, the activation function decides whether the neuron should "fire". Without it, the network is just a linear function.

z = (input1 * weight1) + (input2 * weight2) + bias a = activation_function(z)

Sigmoid

Output: 0 to 1. Used for binary classification output layer. Limitation: vanishing gradient in deep networks.

ReLU

f(x) = max(0, x). Default for hidden layers. Fast and simple. Limitation: dead neurons if input is always negative.

Leaky ReLU

Allows small negative output to fix dead neuron problem. f(x) = x if x > 0, else 0.01x.

Softmax

Converts outputs to probabilities summing to 1. Used for multi-class classification output layer.

import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def relu(x): return np.maximum(0, x) def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x)

Loss Functions

A loss function measures how far the model's prediction is from the actual value. Smaller loss means better performance. The optimizer uses the loss to adjust weights via backpropagation.

Regression

MSE (Mean Squared Error) and MAE (Mean Absolute Error).

Classification

Binary Cross-Entropy (2 classes) and Categorical Cross-Entropy (multi-class).

Optimizers

An optimizer updates weights to minimize the loss function. It uses gradients computed via backpropagation.

W_new = W_old - learning_rate * gradient

Optimizer	How It Works	When to Use
Gradient Descent	Updates after full dataset pass	Small datasets
SGD	Updates after each sample	Large datasets, noisy but fast
Mini-Batch GD	Updates after a batch (e.g., 32)	Most common in practice
AdaGrad	Adapts learning rate per parameter	Sparse data (text)
RMSProp	Fixes AdaGrad's shrinking LR	RNNs, non-stationary problems
Adam	Momentum + RMSProp combined	Default choice for most tasks

Adam (Adaptive Moment Estimation) is the most widely used optimizer. It combines the benefits of momentum and adaptive learning rates. Use it as your default.

Epochs, Batch Size, Learning Rate

Training Hyperparameters

Epoch -- One complete pass through the entire dataset. Training for 10 epochs = model sees each sample 10 times.
Batch Size -- Number of samples processed before updating weights. Too large = high memory; too small = noisy updates. Common: 32, 64, 128.
Learning Rate -- Step size for weight updates. Too high = overshoots; too low = extremely slow training. Typical starting value: 0.001.

Learning rate is the single most important hyperparameter. Start with 0.001 (Adam default). If training loss oscillates, reduce it. If training is too slow, increase it.

When to Use Deep Learning

Good For	Not Ideal For
Images, video, audio, text	Small tabular datasets
Large datasets (thousands+ samples)	When interpretability is critical
Complex non-linear relationships	Low-compute environments
End-to-end feature learning	When classical ML works well enough

ANNCNNRNNLSTMTransformersAdamBackpropagationReLU