fit(), predict(), transform()
The three-method contract that makes every sklearn estimator composable — and why that uniformity matters at scale.
- Explain why sklearn's uniform API enables pipelines
- Distinguish between estimators (predict) and transformers (transform)
- Describe what fit() does in each case
- Use fit_transform() correctly and explain when it should not be used on test data
Most ML libraries expose dozens of models, each with its own interface. Sklearn made a different bet: every object, regardless of what it does, speaks the same three-method language. That constraint is not a limitation — it is what makes composability possible.
The contract
Every sklearn object has a fit() method. What fit() does depends on the
object type, but the signature is always the same: fit(X, y=None). That y
is optional because transformers do not need a target.
Estimators — classifiers and regressors — learn to map inputs to outputs.
After fit(X_train, y_train), they expose predict(X), which applies the
learned mapping to new data. The model artefact lives in attributes ending with
_ (a sklearn convention): coef_, feature_importances_, n_iter_.
Transformers — scalers, encoders, imputers — learn statistics about the
data and apply a transformation. After fit(X_train), they expose transform(X),
which applies the learned statistics. A StandardScaler learns mean and standard
deviation during fit, then uses those to standardise during transform. The
key rule: always fit on training data, then transform both training and test
data using those same statistics. Fitting again on test data leaks information.
fit_transform(X) is a convenience method that calls fit(X) then transform(X)
in one pass. It is correct to use on training data; it must not be used on test
data (it would recompute the statistics from the test set, breaking the
train/test separation).
Why uniformity enables pipelines
Because every step speaks the same language, sklearn can chain them. A
Pipeline calls fit() on each step in sequence, passing the output of one
step as the input to the next. During predict(), it calls transform() on
each intermediate step and predict() on the final estimator. You do not have
to manage that orchestration — the contract handles it.
This becomes critical at deployment. A pipeline bundles preprocessing and model
together. Serialise the pipeline once with joblib.dump, load it anywhere, and
call predict() — the same scaling and encoding that ran during training runs
automatically at inference.
The _ suffix on fitted attributes (like scaler.mean_, tree.feature_importances_)
is a sklearn convention signalling "this was computed during fit". If you see
an attribute without the suffix, it was passed in as a constructor argument
(a hyperparameter). This distinction matters when debugging: constructor
arguments are your choices; underscore attributes are what the data produced.
Where to go next
With the API contract understood, the next lesson applies it concretely: linear regression — fitting a model, inspecting its coefficients, and measuring how well it explains variance in the target.