Cross-validation
Why a single train/test split is a noisy estimate — and how k-fold CV uses all the data for evaluation without peeking at the test set.
- Explain why a single split produces a high-variance accuracy estimate
- Describe k-fold cross-validation and why it gives a more stable estimate
- Explain stratified k-fold and when it is necessary
You split your dataset 80/20, fit a model, and get 84% accuracy. Is that good? More importantly: is it stable? Split differently and you might get 79% or 89%, depending entirely on which examples landed in the test set. A single split is one sample from the distribution of possible evaluation results. Cross-validation takes many samples and averages them.
The problem with a single split
A 20% test set on a 300-sample dataset is 60 examples. The accuracy on those 60 examples has high variance. If 3 of the 60 examples were unusually hard (or unusually easy), your estimate can shift by several percentage points.
Beyond variance, a single split wastes data: those 60 examples contribute nothing to training. On small datasets, this matters.
k-fold cross-validation
k-fold CV solves both problems:
- Divide the dataset into k equally-sized folds.
- For each fold i (1 to k): train on all folds except i, evaluate on fold i.
- Report the mean and standard deviation of the k scores.
Every example is used for training k−1 times and for evaluation exactly once. No example is always in the test set; no example is never in it. The mean score over k folds is a substantially lower-variance estimate of generalisation performance than a single split.
Typical values of k are 5 and 10. Larger k gives lower bias (more training data per fold) but higher computation and higher variance (each test fold is smaller). k=5 is a practical default; k=10 is common when the dataset is large enough to afford it.
Stratified k-fold
Standard k-fold assigns examples to folds randomly without regard to class distribution. On imbalanced datasets, some folds may have very few positive examples, making the fold's accuracy estimate unreliable.
Stratified k-fold preserves the class proportion in every fold. If the
overall dataset is 20% positive, each fold will also be approximately 20%
positive. This is almost always the right choice for classification tasks.
Sklearn's cross_val_score uses StratifiedKFold automatically when the
estimator is a classifier.
k-fold CV gives you an estimate of generalisation performance, not a trained
model ready for deployment. After CV, you typically refit the chosen model
on the full training set to get the best possible artefact. sklearn's
GridSearchCV does this automatically via its refit=True default.
When CV is not enough
Leave-one-out CV (k = n) is the extreme case: each fold is a single example. It produces very low-bias estimates but very high variance, and it is computationally expensive. Rarely used in practice except on very small datasets (n < 50).
For time series data, standard k-fold is incorrect — you would be training on
future data to predict the past. Use TimeSeriesSplit, which respects temporal
order: each fold's training set is strictly earlier than its test set.
Where to go next
Cross-validation gives you a score for a fixed set of hyperparameters. The next lesson automates the search: hyperparameter tuning with GridSearchCV.
Regression metrics
Compute MAE, MSE, RMSE, and R² with sklearn, and understand when each gives a more honest picture of prediction error.
Hyperparameter tuning
Use GridSearchCV to sweep max_depth values, read best_params_ and best_score_, and understand why the test set must stay out of the search loop.