Lab: evaluate honestly
Build a classification pipeline, tune it with 5-fold GridSearchCV, then report the full metric suite on a held-out test set.
- Build a preprocessing + classifier pipeline
- Tune it with 5-fold GridSearchCV
- Report accuracy, precision, recall, F1, and confusion matrix on the held-out test set
- Interpret each metric in the context of the problem
The goal of this lab is an honest evaluation report — the kind you would present to a stakeholder or write into a model card. Honest means: tuning was done with cross-validation on training data only, and the final numbers come from a test set that was never touched until the last step.
Setup — data and baseline
The dataset has a mild class imbalance (65/35). The baseline predictor always predicts the majority class — any model you build must exceed this before it deserves the word "model". Printing the baseline first is good practice.
Checkpoint 2 — build and tune the pipeline
Note that scoring="f1" is used here rather than accuracy — appropriate for
an imbalanced dataset where false negatives and false positives both matter.
GridSearchCV will select the model that maximises F1 on cross-validation, not
the one that maximises accuracy.
Checkpoint 3 — full evaluation on the test set
Interpretation guide
Work through these questions after running the code:
-
Is accuracy above the majority-class baseline? If not, the model has learned nothing useful.
-
Compare precision and recall. If precision is much higher than recall, the model is conservative — it only predicts positive when very confident, missing many real positives. The reverse means it is aggressive, producing many false alarms.
-
Read the confusion matrix. How many positive examples were missed (FN)? In a medical context, FN = missed diagnoses — the most dangerous failure. In a spam filter, FP = important email in spam — the most annoying failure.
-
Is the test F1 close to the CV F1? A large drop suggests overfitting to the training distribution or that the test set is drawn from a slightly different distribution.
A complete evaluation report includes: the metric table above, the confusion matrix, the best hyperparameters, the CV score used for selection, and a sentence explaining which metric was chosen for tuning and why. That context is what lets a future reader trust the numbers.
Where to go next
The Model Evaluation module is complete. Next: Time Series — data indexed by time has structure that standard ML ignores, and different tools are required to exploit it correctly.
Hyperparameter tuning
Use GridSearchCV to sweep max_depth values, read best_params_ and best_score_, and understand why the test set must stay out of the search loop.
Time series structure
Trend, seasonality, and noise are the three components of any time series — and stationarity is the property most forecasting models require.