Classification metrics
Why accuracy alone misleads on imbalanced data — and how precision, recall, F1, and the confusion matrix give a complete picture.
- Explain why accuracy is misleading on imbalanced class distributions
- Define precision, recall, and F1 and state when each is most important
- Read a confusion matrix and identify false positives and false negatives
A model that predicts "no" for every input achieves 99% accuracy on a dataset where 99% of examples are negative. That model is useless — it has learned nothing — but accuracy cannot tell you that. Classification on imbalanced data requires a richer vocabulary of metrics.
The limits of accuracy
Accuracy = correct predictions / total predictions. When classes are balanced (roughly equal counts), accuracy is a sensible summary. When one class dominates, it becomes a trap: a naive majority-class predictor scores well on accuracy without doing any useful work.
The medical diagnosis case makes this concrete. Suppose 1% of patients have a rare condition. A model that always predicts "healthy" achieves 99% accuracy. But it misses every single sick patient. That failure is invisible to accuracy.
Precision and recall
The antidote is to decompose the error:
Precision = TP / (TP + FP)
Of every prediction labelled positive, how many were actually positive? Precision is the relevant metric when the cost of a false positive is high — sending a marketing email to an uninterested customer, flagging a legitimate transaction as fraud.
Recall (also called sensitivity or true positive rate) = TP / (TP + FN)
Of every actual positive, how many did the model catch? Recall is the relevant metric when the cost of a false negative is high — missing a diagnosis, failing to detect a security intrusion.
The two metrics trade off against each other. A model that predicts "positive" for everything has perfect recall (it catches every real positive) but terrible precision (almost all of those predictions are wrong). A model that predicts "positive" only when it is extremely confident has high precision but low recall.
F1 score
F1 is the harmonic mean of precision and recall:
F1 = 2 * (precision * recall) / (precision + recall)It is 1.0 when both are perfect; it pulls toward the lower of the two. The harmonic mean (rather than arithmetic mean) penalises extreme imbalances between the two: a model with precision=1.0 and recall=0.1 gets F1=0.18, not 0.55.
The confusion matrix
The confusion matrix lays out all four outcomes at once:
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actually Negative | True Negative (TN) | False Positive (FP) |
| Actually Positive | False Negative (FN) | True Positive (TP) |
The diagonal is correct predictions. Off-diagonal entries are errors. Reading the matrix shows not just how many errors the model makes but which direction it errs — which matters enormously for downstream decisions.
In sklearn, confusion_matrix(y_true, y_pred) returns a 2×2 array where
cm[0,0] is TN, cm[0,1] is FP, cm[1,0] is FN, and cm[1,1] is TP.
For multi-class problems the matrix extends naturally to n×n.
Choosing the right metric
No single metric is always right. The choice depends on the cost structure of the problem:
- High cost of false positives: optimise precision.
- High cost of false negatives: optimise recall.
- Symmetric costs and balanced classes: F1 or accuracy are both reasonable.
- Imbalanced classes with asymmetric costs: compute all four and report them together; summary statistics hide too much.
Where to go next
The next lesson moves to regression metrics — MAE, RMSE, and R² — which measure the magnitude of continuous prediction errors rather than categorical correctness.