ML Playground / Evaluation Metrics View Notebook

ML Evaluation Metrics

Measuring how well your model performs using the right metric for the right task.

What are Evaluation Metrics?

Evaluation metrics measure how well a model's predictions match actual values. Different tasks require different metrics. Using the wrong metric can give a misleading picture of model quality.

Always evaluate on unseen test data. Evaluating on training data gives over-optimistic results because the model has already seen that data.

Regression Metrics

For models that predict continuous numeric values (prices, temperatures, etc.).

Mean Absolute Error (MAE)

Average of absolute differences between actual and predicted values. Easy to interpret -- it is in the same unit as the target.

MAE = (1/n) * sum(|actual_i - predicted_i|) Lower MAE = better. MAE of 5 means predictions are off by 5 units on average.
from sklearn.metrics import mean_absolute_error mae = mean_absolute_error(y_test, y_pred) print(f"MAE: {mae}")

Mean Squared Error (MSE)

Average of squared differences. Penalizes large errors more heavily than MAE because of the squaring.

MSE = (1/n) * sum((actual_i - predicted_i)^2) Lower MSE = better. More sensitive to outliers than MAE.
from sklearn.metrics import mean_squared_error mse = mean_squared_error(y_test, y_pred) print(f"MSE: {mse}")

Root Mean Squared Error (RMSE)

Square root of MSE. Gives the error in the same units as the target, while still penalizing large errors.

RMSE = sqrt(MSE) Lower RMSE = better. Combines interpretability of MAE with outlier sensitivity of MSE.
import numpy as np rmse = np.sqrt(mse) print(f"RMSE: {rmse}")

R-squared (R2 Score)

Measures how much variance in the target the model explains. Ranges from 0 to 1 (can be negative for very bad models).

R2 = 1 - (SS_res / SS_tot) Where: SS_res = sum((actual - predicted)^2) # residual sum of squares SS_tot = sum((actual - mean)^2) # total sum of squares R2 = 1.0 --> perfect fit R2 = 0.0 --> model is as good as predicting the mean R2 < 0 --> model is worse than predicting the mean
from sklearn.metrics import r2_score r2 = r2_score(y_test, y_pred) print(f"R2 Score: {r2}")

Classification Metrics

For models that predict categories (spam/not spam, disease/healthy, etc.).

Accuracy

Percentage of correct predictions. Simple but misleading on imbalanced datasets.

Accuracy = correct_predictions / total_predictions
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy}")

If 95% of emails are not spam, a model that always predicts "not spam" has 95% accuracy but catches zero spam. Always check precision and recall too.

Confusion Matrix

A 2x2 table showing all four types of predictions:

Predicted PositivePredicted Negative
Actually PositiveTrue Positive (TP)False Negative (FN)
Actually NegativeFalse Positive (FP)True Negative (TN)
from sklearn.metrics import confusion_matrix import seaborn as sns import matplotlib.pyplot as plt cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues') plt.xlabel("Predicted") plt.ylabel("Actual") plt.title("Confusion Matrix") plt.show()

Precision

Of all positive predictions, how many were actually positive? Important when false positives are costly.

Precision = TP / (TP + FP) Example: Of 100 emails flagged as spam, 90 actually were spam --> Precision = 0.90
from sklearn.metrics import precision_score precision = precision_score(y_test, y_pred) print(f"Precision: {precision}")

Recall (Sensitivity)

Of all actual positives, how many did we correctly identify? Important when false negatives are costly.

Recall = TP / (TP + FN) Example: Of 100 actual spam emails, we caught 80 --> Recall = 0.80
from sklearn.metrics import recall_score recall = recall_score(y_test, y_pred) print(f"Recall: {recall}")

F1 Score

Harmonic mean of precision and recall. Balances both metrics. Use when you need a single number that accounts for both false positives and false negatives.

F1 = 2 * (Precision * Recall) / (Precision + Recall) F1 = 1.0 --> perfect precision and recall F1 = 0.0 --> either precision or recall is 0
from sklearn.metrics import f1_score f1 = f1_score(y_test, y_pred) print(f"F1 Score: {f1}")

ROC-AUC Score

Measures the model's ability to distinguish between classes across all thresholds. AUC = 1 means perfect, AUC = 0.5 means random guessing.

from sklearn.metrics import roc_auc_score # y_pred_prob = model.predict_proba(X_test)[:, 1] # get probability of positive class auc = roc_auc_score(y_test, y_pred_prob) print(f"ROC-AUC Score: {auc}")

Which Metric to Use?

ScenarioBest MetricWhy
Regression (predict a number)RMSE or R2RMSE penalizes large errors; R2 shows explanatory power
Balanced classificationAccuracy or F1Accuracy works when classes are balanced
Imbalanced classificationF1, Precision, RecallAccuracy is misleading with imbalanced classes
Spam detectionPrecisionAvoid flagging real emails as spam (minimize FP)
Medical diagnosisRecallAvoid missing actual sick patients (minimize FN)
Ranking / probability modelsROC-AUCMeasures performance across all thresholds

Metrics Regression Classification MAE RMSE F1 ROC-AUC