Lab: compare classification models

Fit three classifiers on the same dataset, measure accuracy and confusion matrices, and identify which generalises best.

The model-selection heuristics from the previous lesson become concrete when you run all three on the same classification problem. The goal here is not just to get numbers — it is to develop the habit of asking why the numbers differ and what that tells you about the data and the models.

The dataset

We use a synthetic binary classification problem: 300 samples, 2 informative features, modest class overlap. It is small enough that all three models run instantly, but large enough for the train/test split to give stable estimates.

Python — editable, runs in your browser

Accuracy is a starting point, but it does not tell you where each model makes mistakes. A model might be correct 85% of the time but wrong on exactly the cases that matter most. The confusion matrix shows the breakdown.

Checkpoint 2 — confusion matrices

Python — editable, runs in your browser

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

X, y = make_classification(
  n_samples=300, n_features=2, n_informative=2,
  n_redundant=0, n_clusters_per_class=1, random_state=7
)

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.25, random_state=7
)

models = {
  "LogisticRegression":  LogisticRegression(max_iter=500, random_state=7),
  "DecisionTree(d=4)":   DecisionTreeClassifier(max_depth=4, random_state=7),
  "KNeighbors(k=7)":     KNeighborsClassifier(n_neighbors=7),
}

for name, model in models.items():
  model.fit(X_train, y_train)
  preds = model.predict(X_test)
  cm = confusion_matrix(y_test, preds)
  print(f"--- {name} ---")
  print(f"  TN={cm[0,0]}  FP={cm[0,1]}")
  print(f"  FN={cm[1,0]}  TP={cm[1,1]}")
  print()

Reading the confusion matrix: each row is the true class; each column is the predicted class. The diagonal is correct predictions (true negatives and true positives). Off-diagonal entries are errors — false positives (predicted positive when actually negative) and false negatives (missed positives).

Two models with identical accuracy can have very different confusion matrices. One might favour false positives; another might favour false negatives. Which is worse depends entirely on the problem domain — a medical test prefers false positives over missed diagnoses.

Checkpoint 3 — full metric sweep

Python — editable, runs in your browser

X, y = make_classification(
  n_samples=300, n_features=2, n_informative=2,
  n_redundant=0, n_clusters_per_class=1, random_state=7
)

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.25, random_state=7
)

header = f"{'Model':<25} {'Acc':>6} {'Prec':>6} {'Rec':>6} {'F1':>6}"
print(header)
print("-" * len(header))

for name, model in models.items():
  model.fit(X_train, y_train)
  p = model.predict(X_test)
  print(
      f"{name:<25} "
      f"{accuracy_score(y_test, p):>6.3f} "
      f"{precision_score(y_test, p):>6.3f} "
      f"{recall_score(y_test, p):>6.3f} "
      f"{f1_score(y_test, p):>6.3f}"
  )

Interpretation exercise

After running the three blocks above, write your interpretation (mentally, or in a notebook). A complete interpretation addresses:

Which model has the highest accuracy, and is the gap meaningful?
Do precision and recall differ substantially across models? What would that mean for a use case where false negatives are costly?
Does the confusion matrix pattern match what the accuracy numbers suggest?
Based on the bias-variance framework: is any model likely overfitting at these settings? How would you test that hypothesis?

For this synthetic dataset, logistic regression often performs comparably to the tree and k-NN because the decision boundary is close to linear. If you increase n_informative or add polynomial structure, the gap will widen. Changing the data and re-running is a fast way to build intuition about when each model earns its keep.

Where to go next

The ML Concepts module is complete. Next: Sklearn in Practice — the uniform fit/predict/transform API that makes all of these models composable into pipelines.

Finished reading? Mark it complete to track your progress.

Lab: compare classification models

The dataset

Checkpoint 2 — confusion matrices

Checkpoint 3 — full metric sweep

Interpretation exercise

Where to go next

On this page