Scaling and normalisation
Min-max scaling and z-score standardisation — what each does, which models need scaling, and why the scaler must be fit on training data only.
- Explain what min-max scaling and z-score standardisation each produce
- Identify which types of model are sensitive to feature scale
- State why the scaler must be fit on training data and not on the full dataset
A DataFrame with columns age (range 18–80) and income (range 20,000–200,000)
has a problem. The income column is numerically about 3,000 times larger than
the age column. That does not mean income is more important — it just means the
two columns live on incomparable scales. Some models handle this fine. Others
collapse: they treat income as dominant purely because its numbers are bigger.
Min-max scaling
Min-max scaling maps every value to the range [0, 1]:
x_scaled = (x - x_min) / (x_max - x_min)The smallest value becomes 0, the largest becomes 1, and everything else is proportional in between. The distribution's shape is preserved — if the original data was right-skewed, the scaled version is still right-skewed.
Min-max scaling is sensitive to outliers. A single extreme value shifts the minimum or maximum, which compresses everything else into a narrow range. If your data has outliers, z-score standardisation or robust scaling (using percentiles instead of min/max) may be more appropriate.
Z-score standardisation
Standardisation shifts the data so it has mean 0 and standard deviation 1:
x_std = (x - mean) / stdThere is no fixed output range — values can be negative, and outliers still appear as large positive or negative numbers. But the units are now standard deviations from the mean, which is a meaningful scale regardless of the original units. A score of 2.5 means "2.5 standard deviations above the mean" for any feature.
Which models need scaling?
Distance-based models measure similarity using Euclidean distance or similar metrics. k-Nearest Neighbours and Support Vector Machines compute distances between points. If one feature spans 0–100,000 and another spans 0–1, the large one dominates every distance calculation. These models require scaling.
Gradient-based models (linear regression, logistic regression, neural networks) use an optimiser that takes steps along a gradient. Unscaled features produce an elongated loss surface where the optimiser zig-zags rather than converging cleanly. Scaling makes training faster and more stable.
Tree-based models (decision trees, random forests, gradient boosting) split
on thresholds. The split age > 35 is identical whether age is in the original
scale or scaled to [0,1]. Trees are not sensitive to feature scale.
| Model type | Needs scaling? |
|---|---|
| k-NN, SVM | Yes |
| Linear/logistic regression, neural nets | Yes |
| Decision tree, random forest, XGBoost | No |
The data leakage rule
Fitting the scaler means computing the min, max, mean, or standard deviation of the training data. If you fit the scaler on the full dataset (train + test), test data statistics leak into the training process — the model indirectly "sees" information from the test set before evaluation. Always:
- Split the data into train and test first.
- Fit the scaler on the training set only.
- Apply (transform) the fitted scaler to both train and test.
Fitting a scaler on the full dataset before splitting is one of the most common sources of data leakage. It inflates test performance and gives a falsely optimistic picture of how the model will perform on new data.
Where to go next
Next: scaling in practice — applying MinMaxScaler and StandardScaler
from scikit-learn, fitting on training data, and transforming both splits.
Encoding categoricals
Convert string category columns to numbers with pd.get_dummies() for nominal data and manual ordinal maps for ordered categories.
Scaling in practice
Apply MinMaxScaler and StandardScaler from sklearn — fit on training data, transform both splits, and verify the before/after statistics.