Lab: evaluate honestly

Build a classification pipeline, tune it with 5-fold GridSearchCV, then report the full metric suite on a held-out test set.

The goal of this lab is an honest evaluation report — the kind you would present to a stakeholder or write into a model card. Honest means: tuning was done with cross-validation on training data only, and the final numbers come from a test set that was never touched until the last step.

Setup — data and baseline

Python — editable, runs in your browser

The dataset has a mild class imbalance (65/35). The baseline predictor always predicts the majority class — any model you build must exceed this before it deserves the word "model". Printing the baseline first is good practice.

Checkpoint 2 — build and tune the pipeline

Python — editable, runs in your browser

Note that scoring="f1" is used here rather than accuracy — appropriate for an imbalanced dataset where false negatives and false positives both matter. GridSearchCV will select the model that maximises F1 on cross-validation, not the one that maximises accuracy.

Checkpoint 3 — full evaluation on the test set

Python — editable, runs in your browser

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
  accuracy_score, precision_score, recall_score,
  f1_score, confusion_matrix
)

X, y = make_classification(
  n_samples=600, n_features=12, n_informative=7,
  n_redundant=3, weights=[0.65, 0.35], random_state=11
)

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, stratify=y, random_state=11
)

pipe = Pipeline([
  ("scaler", StandardScaler()),
  ("clf",    DecisionTreeClassifier(random_state=11)),
])

grid = GridSearchCV(
  pipe,
  {"clf__max_depth": [2, 4, 6, 8], "clf__min_samples_leaf": [1, 5, 10]},
  cv=5, scoring="f1", n_jobs=-1
)
grid.fit(X_train, y_train)

best = grid.best_estimator_
preds = best.predict(X_test)

print("=== Final test-set evaluation ===")
print(f"Accuracy:  {accuracy_score(y_test, preds):.4f}")
print(f"Precision: {precision_score(y_test, preds):.4f}")
print(f"Recall:    {recall_score(y_test, preds):.4f}")
print(f"F1:        {f1_score(y_test, preds):.4f}")

cm = confusion_matrix(y_test, preds)
print(f"\nConfusion matrix:")
print(f"  TN={cm[0,0]}  FP={cm[0,1]}")
print(f"  FN={cm[1,0]}  TP={cm[1,1]}")
print(f"\nBest params: {grid.best_params_}")

Interpretation guide

Work through these questions after running the code:

Is accuracy above the majority-class baseline? If not, the model has learned nothing useful.
Compare precision and recall. If precision is much higher than recall, the model is conservative — it only predicts positive when very confident, missing many real positives. The reverse means it is aggressive, producing many false alarms.
Read the confusion matrix. How many positive examples were missed (FN)? In a medical context, FN = missed diagnoses — the most dangerous failure. In a spam filter, FP = important email in spam — the most annoying failure.
Is the test F1 close to the CV F1? A large drop suggests overfitting to the training distribution or that the test set is drawn from a slightly different distribution.

A complete evaluation report includes: the metric table above, the confusion matrix, the best hyperparameters, the CV score used for selection, and a sentence explaining which metric was chosen for tuning and why. That context is what lets a future reader trust the numbers.

Where to go next

The Model Evaluation module is complete. Next: Time Series — data indexed by time has structure that standard ML ignores, and different tools are required to exploit it correctly.

Finished reading? Mark it complete to track your progress.

Setup — data and baseline

Checkpoint 2 — build and tune the pipeline

Checkpoint 3 — full evaluation on the test set

Interpretation guide

Where to go next

On this page