Decision trees
How trees split data, why max_depth is the single most important hyperparameter, and how to recognise overfitting before it reaches production.
- Explain how a decision tree chooses a split using gini impurity or entropy
- Describe max_depth as the primary regularisation knob
- Identify the signs of an overfitting tree from train vs test accuracy
A decision tree partitions the feature space by asking a sequence of binary questions: "Is feature X greater than threshold T?" Each question splits the data into two groups; the process recurses until a stopping criterion is met. The appeal is that the resulting logic is human-readable — you can print a shallow tree and follow the path from any input to its prediction by hand.
How splits are chosen
At each node, the algorithm considers every feature and every possible threshold and picks the split that maximally reduces impurity in the resulting child nodes. Two impurity measures dominate in practice:
Gini impurity measures the probability that a randomly chosen element from a node would be misclassified if labelled randomly according to the class distribution. A pure node (all one class) has gini = 0. A maximally mixed binary node has gini = 0.5. Sklearn's classifiers use gini by default.
Entropy (information gain) measures the expected number of bits needed to describe the class label. Conceptually similar to gini; in practice the difference in split quality is small. Entropy is slightly more expensive to compute.
For regressors, the criterion is variance reduction: the split that minimises the weighted sum of variance in the two child nodes.
The algorithm is greedy — it picks the locally best split at each node without backtracking. This means the globally optimal tree is not guaranteed, but in practice the greedy solution is close and the computation is tractable.
max_depth as a regularisation knob
Left unconstrained, a decision tree will grow until every leaf is pure — which means one leaf per training sample. The model has memorised every point, including the noise. Bias is near zero; variance is enormous.
max_depth is the primary lever. It stops the tree after a fixed number of
splits. Shallow trees (depth 2–4) have high bias but low variance: the model
is simple enough to generalise, but may miss fine-grained patterns. Deeper
trees capture more complexity at the cost of overfitting risk.
Other constraints serve similar purposes: min_samples_leaf prevents tiny
leaves from forming; min_samples_split prevents splitting a node with too
few examples. All of these are regularisation mechanisms that trade off bias
for variance reduction.
Decision trees are the building block of two of the most powerful ensemble methods in practice: Random Forests (many deep trees averaging their predictions) and Gradient Boosted Trees (many shallow trees correcting each other's errors). Understanding the individual tree makes ensembles interpretable.
Recognising overfitting
The tell-tale pattern is a large gap between training accuracy and test accuracy. A tree with max_depth=None might score 1.000 on training data and 0.780 on test data — the extra 0.220 is pure memorisation of noise. Constraining depth closes the gap at the cost of some training accuracy. When training accuracy and test accuracy converge to roughly the same value, the model is in the right zone.
Where to go next
The next lesson makes this concrete: fit DecisionTreeClassifier at depths 1,
3, and 10 on the same split, print both scores, and watch overfitting emerge in
the numbers.