Classification metrics

Why accuracy alone misleads on imbalanced data — and how precision, recall, F1, and the confusion matrix give a complete picture.

A model that predicts "no" for every input achieves 99% accuracy on a dataset where 99% of examples are negative. That model is useless — it has learned nothing — but accuracy cannot tell you that. Classification on imbalanced data requires a richer vocabulary of metrics.

The limits of accuracy

Accuracy = correct predictions / total predictions. When classes are balanced (roughly equal counts), accuracy is a sensible summary. When one class dominates, it becomes a trap: a naive majority-class predictor scores well on accuracy without doing any useful work.

The medical diagnosis case makes this concrete. Suppose 1% of patients have a rare condition. A model that always predicts "healthy" achieves 99% accuracy. But it misses every single sick patient. That failure is invisible to accuracy.

Precision and recall

The antidote is to decompose the error:

Precision = TP / (TP + FP)

Of every prediction labelled positive, how many were actually positive? Precision is the relevant metric when the cost of a false positive is high — sending a marketing email to an uninterested customer, flagging a legitimate transaction as fraud.

Recall (also called sensitivity or true positive rate) = TP / (TP + FN)

Of every actual positive, how many did the model catch? Recall is the relevant metric when the cost of a false negative is high — missing a diagnosis, failing to detect a security intrusion.

The two metrics trade off against each other. A model that predicts "positive" for everything has perfect recall (it catches every real positive) but terrible precision (almost all of those predictions are wrong). A model that predicts "positive" only when it is extremely confident has high precision but low recall.

F1 score

F1 is the harmonic mean of precision and recall:

F1 = 2 * (precision * recall) / (precision + recall)

It is 1.0 when both are perfect; it pulls toward the lower of the two. The harmonic mean (rather than arithmetic mean) penalises extreme imbalances between the two: a model with precision=1.0 and recall=0.1 gets F1=0.18, not 0.55.

The confusion matrix

The confusion matrix lays out all four outcomes at once:

	Predicted Negative	Predicted Positive
Actually Negative	True Negative (TN)	False Positive (FP)
Actually Positive	False Negative (FN)	True Positive (TP)

The diagonal is correct predictions. Off-diagonal entries are errors. Reading the matrix shows not just how many errors the model makes but which direction it errs — which matters enormously for downstream decisions.

In sklearn, confusion_matrix(y_true, y_pred) returns a 2×2 array where cm[0,0] is TN, cm[0,1] is FP, cm[1,0] is FN, and cm[1,1] is TP. For multi-class problems the matrix extends naturally to n×n.

Choosing the right metric

No single metric is always right. The choice depends on the cost structure of the problem:

High cost of false positives: optimise precision.
High cost of false negatives: optimise recall.
Symmetric costs and balanced classes: F1 or accuracy are both reasonable.
Imbalanced classes with asymmetric costs: compute all four and report them together; summary statistics hide too much.

Where to go next

The next lesson moves to regression metrics — MAE, RMSE, and R² — which measure the magnitude of continuous prediction errors rather than categorical correctness.

Finished reading? Mark it complete to track your progress.