Cross-validation

Why a single train/test split is a noisy estimate — and how k-fold CV uses all the data for evaluation without peeking at the test set.

You split your dataset 80/20, fit a model, and get 84% accuracy. Is that good? More importantly: is it stable? Split differently and you might get 79% or 89%, depending entirely on which examples landed in the test set. A single split is one sample from the distribution of possible evaluation results. Cross-validation takes many samples and averages them.

The problem with a single split

A 20% test set on a 300-sample dataset is 60 examples. The accuracy on those 60 examples has high variance. If 3 of the 60 examples were unusually hard (or unusually easy), your estimate can shift by several percentage points.

Beyond variance, a single split wastes data: those 60 examples contribute nothing to training. On small datasets, this matters.

k-fold cross-validation

k-fold CV solves both problems:

Divide the dataset into k equally-sized folds.
For each fold i (1 to k): train on all folds except i, evaluate on fold i.
Report the mean and standard deviation of the k scores.

Every example is used for training k−1 times and for evaluation exactly once. No example is always in the test set; no example is never in it. The mean score over k folds is a substantially lower-variance estimate of generalisation performance than a single split.

Typical values of k are 5 and 10. Larger k gives lower bias (more training data per fold) but higher computation and higher variance (each test fold is smaller). k=5 is a practical default; k=10 is common when the dataset is large enough to afford it.

Stratified k-fold

Standard k-fold assigns examples to folds randomly without regard to class distribution. On imbalanced datasets, some folds may have very few positive examples, making the fold's accuracy estimate unreliable.

Stratified k-fold preserves the class proportion in every fold. If the overall dataset is 20% positive, each fold will also be approximately 20% positive. This is almost always the right choice for classification tasks. Sklearn's cross_val_score uses StratifiedKFold automatically when the estimator is a classifier.

k-fold CV gives you an estimate of generalisation performance, not a trained model ready for deployment. After CV, you typically refit the chosen model on the full training set to get the best possible artefact. sklearn's GridSearchCV does this automatically via its refit=True default.

When CV is not enough

Leave-one-out CV (k = n) is the extreme case: each fold is a single example. It produces very low-bias estimates but very high variance, and it is computationally expensive. Rarely used in practice except on very small datasets (n < 50).

For time series data, standard k-fold is incorrect — you would be training on future data to predict the past. Use TimeSeriesSplit, which respects temporal order: each fold's training set is strictly earlier than its test set.

Where to go next

Cross-validation gives you a score for a fixed set of hyperparameters. The next lesson automates the search: hyperparameter tuning with GridSearchCV.

Finished reading? Mark it complete to track your progress.

The problem with a single split

k-fold cross-validation

Stratified k-fold

When CV is not enough

Where to go next

On this page