ML Evaluation Metrics
Measuring how well your model performs using the right metric for the right task.
What are Evaluation Metrics?
Evaluation metrics measure how well a model's predictions match actual values. Different tasks require different metrics. Using the wrong metric can give a misleading picture of model quality.
Always evaluate on unseen test data. Evaluating on training data gives over-optimistic results because the model has already seen that data.
Regression Metrics
For models that predict continuous numeric values (prices, temperatures, etc.).
Mean Absolute Error (MAE)
Average of absolute differences between actual and predicted values. Easy to interpret -- it is in the same unit as the target.
MAE = (1/n) * sum(|actual_i - predicted_i|)
Lower MAE = better. MAE of 5 means predictions are off by 5 units on average.
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae}")
Mean Squared Error (MSE)
Average of squared differences. Penalizes large errors more heavily than MAE because of the squaring.
MSE = (1/n) * sum((actual_i - predicted_i)^2)
Lower MSE = better. More sensitive to outliers than MAE.
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse}")
Root Mean Squared Error (RMSE)
Square root of MSE. Gives the error in the same units as the target, while still penalizing large errors.
RMSE = sqrt(MSE)
Lower RMSE = better. Combines interpretability of MAE with outlier sensitivity of MSE.
import numpy as np
rmse = np.sqrt(mse)
print(f"RMSE: {rmse}")
R-squared (R2 Score)
Measures how much variance in the target the model explains. Ranges from 0 to 1 (can be negative for very bad models).
R2 = 1 - (SS_res / SS_tot)
Where:
SS_res = sum((actual - predicted)^2) # residual sum of squares
SS_tot = sum((actual - mean)^2) # total sum of squares
R2 = 1.0 --> perfect fit
R2 = 0.0 --> model is as good as predicting the mean
R2 < 0 --> model is worse than predicting the mean
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f"R2 Score: {r2}")
Classification Metrics
For models that predict categories (spam/not spam, disease/healthy, etc.).
Accuracy
Percentage of correct predictions. Simple but misleading on imbalanced datasets.
Accuracy = correct_predictions / total_predictions
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
If 95% of emails are not spam, a model that always predicts "not spam" has 95% accuracy but catches zero spam. Always check precision and recall too.
Confusion Matrix
A 2x2 table showing all four types of predictions:
| Predicted Positive | Predicted Negative |
| Actually Positive | True Positive (TP) | False Negative (FN) |
| Actually Negative | False Positive (FP) | True Negative (TN) |
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()
Precision
Of all positive predictions, how many were actually positive? Important when false positives are costly.
Precision = TP / (TP + FP)
Example: Of 100 emails flagged as spam, 90 actually were spam --> Precision = 0.90
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred)
print(f"Precision: {precision}")
Recall (Sensitivity)
Of all actual positives, how many did we correctly identify? Important when false negatives are costly.
Recall = TP / (TP + FN)
Example: Of 100 actual spam emails, we caught 80 --> Recall = 0.80
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred)
print(f"Recall: {recall}")
F1 Score
Harmonic mean of precision and recall. Balances both metrics. Use when you need a single number that accounts for both false positives and false negatives.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
F1 = 1.0 --> perfect precision and recall
F1 = 0.0 --> either precision or recall is 0
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred)
print(f"F1 Score: {f1}")
ROC-AUC Score
Measures the model's ability to distinguish between classes across all thresholds. AUC = 1 means perfect, AUC = 0.5 means random guessing.
from sklearn.metrics import roc_auc_score
# y_pred_prob = model.predict_proba(X_test)[:, 1] # get probability of positive class
auc = roc_auc_score(y_test, y_pred_prob)
print(f"ROC-AUC Score: {auc}")
Which Metric to Use?
| Scenario | Best Metric | Why |
| Regression (predict a number) | RMSE or R2 | RMSE penalizes large errors; R2 shows explanatory power |
| Balanced classification | Accuracy or F1 | Accuracy works when classes are balanced |
| Imbalanced classification | F1, Precision, Recall | Accuracy is misleading with imbalanced classes |
| Spam detection | Precision | Avoid flagging real emails as spam (minimize FP) |
| Medical diagnosis | Recall | Avoid missing actual sick patients (minimize FN) |
| Ranking / probability models | ROC-AUC | Measures performance across all thresholds |
Metrics Regression Classification MAE RMSE F1 ROC-AUC