Naive Bayes Classifier

A fast, probabilistic classifier based on Bayes' theorem with the "naive" assumption of feature independence.

What is Naive Bayes?

Naive Bayes is a family of probabilistic classifiers that applies Bayes' theorem with a strong (naive) assumption: all features are independent of each other given the class label. Despite this simplification, it works surprisingly well in practice, especially for text classification.

Naive Bayes is one of the fastest classifiers. It needs very little training data and scales linearly with the number of features.

Bayes' Theorem

The foundation of the algorithm:

P(Class | Features) = P(Features | Class) * P(Class) / P(Features) Where: P(Class | Features) = Posterior probability (what we want) P(Features | Class) = Likelihood (how likely these features appear in this class) P(Class) = Prior probability (how common is this class) P(Features) = Evidence (normalizing constant)

The "Naive" Assumption

The algorithm assumes every feature is independent of every other feature. For example, in spam detection, the presence of the word "free" is considered independent of the presence of "money". This is rarely true in reality, but the algorithm still performs well because:

The ranking of probabilities matters more than their exact values
Dependencies between features often cancel each other out
The simplification makes the math tractable and fast

Types of Naive Bayes

Gaussian NB

Features are continuous and follow a normal distribution. Used for numeric data like height, weight, temperature.

Multinomial NB

Features are discrete counts (word frequencies). The go-to choice for text classification with bag-of-words or TF-IDF.

Bernoulli NB

Features are binary (0 or 1). Used when you care about presence/absence of a feature, not its frequency.

Complement NB

Variation of Multinomial NB that handles imbalanced datasets better. Uses complement of each class for weight calculation.

How It Works (Step by Step)

Algorithm Steps

Calculate prior probabilities — P(Class) for each class from training data
Calculate likelihoods — P(Feature | Class) for each feature-class combination
For a new sample — multiply the prior by all feature likelihoods for each class
Pick the class with the highest posterior probability

Example: Email Spam Detection

Given an email with words ["free", "winner", "click"], classify it as spam or not spam:

Code Implementation

from sklearn.naive_bayes import GaussianNB, MultinomialNB from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report # --- Example 1: Gaussian NB for numeric data --- from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) gnb = GaussianNB() gnb.fit(X_train, y_train) y_pred = gnb.predict(X_test) print(f"Gaussian NB Accuracy: {accuracy_score(y_test, y_pred):.2f}") # --- Example 2: Multinomial NB for text --- emails = ["free money now", "meeting at 3pm", "win a prize free", "project deadline tomorrow", "claim your reward", "lunch plans today"] labels = [1, 0, 1, 0, 1, 0] # 1=spam, 0=not spam vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(emails) mnb = MultinomialNB() mnb.fit(X, labels) new_email = vectorizer.transform(["free prize winner"]) print(f"Prediction: {'Spam' if mnb.predict(new_email)[0] else 'Not Spam'}") print(f"Probabilities: {mnb.predict_proba(new_email)}")

Laplace Smoothing

What if a word never appeared in the training data for a certain class? Its probability would be 0, making the entire product 0. Laplace smoothing adds a small count (alpha) to every feature to prevent zero probabilities:

P(word | class) = (count(word, class) + alpha) / (total_words_in_class + alpha * vocabulary_size) Default alpha = 1 (add-one smoothing)

When to Use Naive Bayes

Good For	Not Ideal For
Text classification (spam, sentiment, topic)	Features with strong dependencies
Small training datasets	Numeric data with complex relationships
Real-time predictions (very fast)	When you need precise probability estimates
Multi-class classification	Regression tasks
Baseline model for any classification task	Image classification (CNNs are better)

Despite the independence assumption being "wrong", Naive Bayes often outperforms more complex models on text data, especially with limited training samples.

Key Parameters

alpha (smoothing) — Default 1.0. Smaller values = less smoothing. Set to 0 for no smoothing
fit_prior — Whether to learn class priors from data (True) or use uniform priors (False)
var_smoothing (GaussianNB) — Portion of largest variance added to all variances for stability