Code of the Day
AdvancedModel Evaluation

Hyperparameter tuning

Use GridSearchCV to sweep max_depth values, read best_params_ and best_score_, and understand why the test set must stay out of the search loop.

Data ScienceAdvanced10 min read
Recommended first
By the end of this lesson you will be able to:
  • Run GridSearchCV over a single hyperparameter and read the results
  • Explain why the test set must not be used during hyperparameter selection
  • Interpret best_params_ and best_score_ correctly

Cross-validation tells you how well a model generalises with a specific set of hyperparameters. GridSearchCV automates the obvious next step: try many combinations, use CV to score each, and return the best one. The critical constraint is that the test set must never appear in this loop.

Why the test set must stay out

Suppose you tune max_depth by evaluating on the test set directly. You pick the depth that scores highest on those specific 60 examples. Now those 60 examples have shaped your choice — you have effectively trained on them. When you report test accuracy, you are reporting a metric contaminated by the search itself.

GridSearchCV avoids this by using cross-validation on the training set only. The test set remains completely untouched until a single final evaluation after the best model is selected.

Running GridSearchCV

Python — editable, runs in your browser

Reading the output

best_params_ tells you which combination GridSearchCV selected. best_score_ is the mean cross-validation score for that combination — computed entirely on the training set, using 5-fold CV.

The full sweep table is revealing. Watch the mean_train column: it rises monotonically as depth increases (more depth = better fit to training data). The mean_cv column peaks somewhere in the middle — deeper than that, variance grows faster than bias shrinks, and CV performance degrades.

Final evaluation on the test set

Python — editable, runs in your browser

The test accuracy is the number you report and trust. It was computed exactly once, on data that played no role in training or hyperparameter selection.

If the test accuracy is substantially higher than the CV score, something is wrong — likely the test set was seen during the search. If it is lower than the CV score, that is normal: CV scores are slightly optimistic because the training set used in CV is smaller than the full training set used for the final refit. A gap of 1–3 percentage points is expected.

Where to go next

The lab puts all of this together: build a classification pipeline, tune it with GridSearchCV, and report the full suite of metrics on the held-out test set — the honest evaluation protocol end to end.

Finished reading? Mark it complete to track your progress.

On this page