Code of the Day
AdvancedModel Evaluation

Lab: evaluate honestly

Build a classification pipeline, tune it with 5-fold GridSearchCV, then report the full metric suite on a held-out test set.

Lab · optionalData ScienceAdvanced30 min
Recommended first
By the end of this lesson you will be able to:
  • Build a preprocessing + classifier pipeline
  • Tune it with 5-fold GridSearchCV
  • Report accuracy, precision, recall, F1, and confusion matrix on the held-out test set
  • Interpret each metric in the context of the problem

The goal of this lab is an honest evaluation report — the kind you would present to a stakeholder or write into a model card. Honest means: tuning was done with cross-validation on training data only, and the final numbers come from a test set that was never touched until the last step.

Setup — data and baseline

Python — editable, runs in your browser

The dataset has a mild class imbalance (65/35). The baseline predictor always predicts the majority class — any model you build must exceed this before it deserves the word "model". Printing the baseline first is good practice.

Checkpoint 2 — build and tune the pipeline

Python — editable, runs in your browser

Note that scoring="f1" is used here rather than accuracy — appropriate for an imbalanced dataset where false negatives and false positives both matter. GridSearchCV will select the model that maximises F1 on cross-validation, not the one that maximises accuracy.

Checkpoint 3 — full evaluation on the test set

Python — editable, runs in your browser

Interpretation guide

Work through these questions after running the code:

  1. Is accuracy above the majority-class baseline? If not, the model has learned nothing useful.

  2. Compare precision and recall. If precision is much higher than recall, the model is conservative — it only predicts positive when very confident, missing many real positives. The reverse means it is aggressive, producing many false alarms.

  3. Read the confusion matrix. How many positive examples were missed (FN)? In a medical context, FN = missed diagnoses — the most dangerous failure. In a spam filter, FP = important email in spam — the most annoying failure.

  4. Is the test F1 close to the CV F1? A large drop suggests overfitting to the training distribution or that the test set is drawn from a slightly different distribution.

A complete evaluation report includes: the metric table above, the confusion matrix, the best hyperparameters, the CV score used for selection, and a sentence explaining which metric was chosen for tuning and why. That context is what lets a future reader trust the numbers.

Where to go next

The Model Evaluation module is complete. Next: Time Series — data indexed by time has structure that standard ML ignores, and different tools are required to exploit it correctly.

Finished reading? Mark it complete to track your progress.

On this page