Naive Bayes Classifier
A fast, probabilistic classifier based on Bayes' theorem with the "naive" assumption of feature independence.
What is Naive Bayes?
Naive Bayes is a family of probabilistic classifiers that applies Bayes' theorem with a strong (naive) assumption: all features are independent of each other given the class label. Despite this simplification, it works surprisingly well in practice, especially for text classification.
Naive Bayes is one of the fastest classifiers. It needs very little training data and scales linearly with the number of features.
Bayes' Theorem
The foundation of the algorithm:
P(Class | Features) = P(Features | Class) * P(Class) / P(Features)
Where:
P(Class | Features) = Posterior probability (what we want)
P(Features | Class) = Likelihood (how likely these features appear in this class)
P(Class) = Prior probability (how common is this class)
P(Features) = Evidence (normalizing constant)
The "Naive" Assumption
The algorithm assumes every feature is independent of every other feature. For example, in spam detection, the presence of the word "free" is considered independent of the presence of "money". This is rarely true in reality, but the algorithm still performs well because:
- The ranking of probabilities matters more than their exact values
- Dependencies between features often cancel each other out
- The simplification makes the math tractable and fast
Types of Naive Bayes
Gaussian NB
Features are continuous and follow a normal distribution. Used for numeric data like height, weight, temperature.
Multinomial NB
Features are discrete counts (word frequencies). The go-to choice for text classification with bag-of-words or TF-IDF.
Bernoulli NB
Features are binary (0 or 1). Used when you care about presence/absence of a feature, not its frequency.
Complement NB
Variation of Multinomial NB that handles imbalanced datasets better. Uses complement of each class for weight calculation.
How It Works (Step by Step)
Algorithm Steps
- Calculate prior probabilities — P(Class) for each class from training data
- Calculate likelihoods — P(Feature | Class) for each feature-class combination
- For a new sample — multiply the prior by all feature likelihoods for each class
- Pick the class with the highest posterior probability
Example: Email Spam Detection
Given an email with words ["free", "winner", "click"], classify it as spam or not spam:
P(Spam | "free","winner","click") ∝ P(Spam) × P("free"|Spam) × P("winner"|Spam) × P("click"|Spam)
P(Not Spam | "free","winner","click") ∝ P(Not Spam) × P("free"|Not Spam) × P("winner"|Not Spam) × P("click"|Not Spam)
Compare the two → pick the higher one
Code Implementation
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
# --- Example 1: Gaussian NB for numeric data ---
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print(f"Gaussian NB Accuracy: {accuracy_score(y_test, y_pred):.2f}")
# --- Example 2: Multinomial NB for text ---
emails = ["free money now", "meeting at 3pm", "win a prize free",
"project deadline tomorrow", "claim your reward", "lunch plans today"]
labels = [1, 0, 1, 0, 1, 0] # 1=spam, 0=not spam
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(emails)
mnb = MultinomialNB()
mnb.fit(X, labels)
new_email = vectorizer.transform(["free prize winner"])
print(f"Prediction: {'Spam' if mnb.predict(new_email)[0] else 'Not Spam'}")
print(f"Probabilities: {mnb.predict_proba(new_email)}")
Laplace Smoothing
What if a word never appeared in the training data for a certain class? Its probability would be 0, making the entire product 0. Laplace smoothing adds a small count (alpha) to every feature to prevent zero probabilities:
P(word | class) = (count(word, class) + alpha) / (total_words_in_class + alpha * vocabulary_size)
Default alpha = 1 (add-one smoothing)
When to Use Naive Bayes
| Good For | Not Ideal For |
| Text classification (spam, sentiment, topic) | Features with strong dependencies |
| Small training datasets | Numeric data with complex relationships |
| Real-time predictions (very fast) | When you need precise probability estimates |
| Multi-class classification | Regression tasks |
| Baseline model for any classification task | Image classification (CNNs are better) |
Despite the independence assumption being "wrong", Naive Bayes often outperforms more complex models on text data, especially with limited training samples.
Key Parameters
- alpha (smoothing) — Default 1.0. Smaller values = less smoothing. Set to 0 for no smoothing
- fit_prior — Whether to learn class priors from data (True) or use uniform priors (False)
- var_smoothing (GaussianNB) — Portion of largest variance added to all variances for stability