Hyperparameter tuning

Use GridSearchCV to sweep max_depth values, read best_params_ and best_score_, and understand why the test set must stay out of the search loop.

Cross-validation tells you how well a model generalises with a specific set of hyperparameters. GridSearchCV automates the obvious next step: try many combinations, use CV to score each, and return the best one. The critical constraint is that the test set must never appear in this loop.

Why the test set must stay out

Suppose you tune max_depth by evaluating on the test set directly. You pick the depth that scores highest on those specific 60 examples. Now those 60 examples have shaped your choice — you have effectively trained on them. When you report test accuracy, you are reporting a metric contaminated by the search itself.

GridSearchCV avoids this by using cross-validation on the training set only. The test set remains completely untouched until a single final evaluation after the best model is selected.

Running GridSearchCV

Python — editable, runs in your browser

import numpy as np
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV

X, y = make_classification(
  n_samples=500, n_features=10, n_informative=6,
  n_redundant=2, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

param_grid = {"max_depth": [2, 4, 6, 8, 10, None]}

grid = GridSearchCV(
  DecisionTreeClassifier(random_state=42),
  param_grid,
  cv=5,
  scoring="accuracy",
  return_train_score=True,
)
grid.fit(X_train, y_train)

print("Best depth:    ", grid.best_params_["max_depth"])
print("Best CV score: ", round(grid.best_score_, 4))

# Show the full sweep
import pandas as pd
results = pd.DataFrame({
  "depth":      [p["max_depth"] for p in grid.cv_results_["params"]],
  "mean_cv":    grid.cv_results_["mean_test_score"].round(4),
  "mean_train": grid.cv_results_["mean_train_score"].round(4),
})
print("\nFull sweep:")
print(results.to_string(index=False))

Reading the output

best_params_ tells you which combination GridSearchCV selected. best_score_ is the mean cross-validation score for that combination — computed entirely on the training set, using 5-fold CV.

The full sweep table is revealing. Watch the mean_train column: it rises monotonically as depth increases (more depth = better fit to training data). The mean_cv column peaks somewhere in the middle — deeper than that, variance grows faster than bias shrinks, and CV performance degrades.

Final evaluation on the test set

Python — editable, runs in your browser

The test accuracy is the number you report and trust. It was computed exactly once, on data that played no role in training or hyperparameter selection.

If the test accuracy is substantially higher than the CV score, something is wrong — likely the test set was seen during the search. If it is lower than the CV score, that is normal: CV scores are slightly optimistic because the training set used in CV is smaller than the full training set used for the final refit. A gap of 1–3 percentage points is expected.

Where to go next

The lab puts all of this together: build a classification pipeline, tune it with GridSearchCV, and report the full suite of metrics on the held-out test set — the honest evaluation protocol end to end.

Finished reading? Mark it complete to track your progress.

Why the test set must stay out

Running GridSearchCV

Reading the output

Final evaluation on the test set

Where to go next

On this page