Lab: build a sklearn pipeline
Encode categoricals, scale numerics, add a model, evaluate with cross-validation, and tune max_depth — all inside one Pipeline.
- Construct a ColumnTransformer that handles categorical and numeric features separately
- Wrap it with a DecisionTreeClassifier in a Pipeline
- Evaluate with 5-fold cross-validation
- Run GridSearchCV over max_depth and interpret best_params_ and best_score_
The previous lessons each handled one concern: scaling, cross-validation, trees. Real datasets mix categorical and numeric features, and a production pipeline must handle both correctly without leaking information across folds. This lab assembles all the pieces.
The dataset
We build a synthetic dataset with two numeric features and one categorical feature, plus a binary target. This pattern — mixed types — is the norm in practice (think: age + income as numerics, region as categorical).
Checkpoint 2 — build the column transformer
ColumnTransformer applies different transformations to different columns.
Numeric columns get StandardScaler; the categorical column gets
OneHotEncoder. The remainder='drop' setting discards any column not
explicitly listed.
Checkpoint 3 — full pipeline with cross-validation
The pipeline fits the ColumnTransformer inside each fold — the scaler and
encoder never see the validation fold during fitting. That is the leak-free
guarantee in action across five folds simultaneously.
Checkpoint 4 — tune max_depth with GridSearchCV
Notice the clf__ prefix in the parameter grid — that is the pipeline's
step-name syntax for reaching inside the classifier. GridSearchCV uses
cross-validation on the training set to select the best depth, then you
evaluate grid.best_estimator_ on the held-out test set exactly once.
grid.best_estimator_ is already fitted on the full training set (because
GridSearchCV sets refit=True by default). You can call .predict() on it
directly without refitting.
Where to go next
The Sklearn in Practice module is complete. Next: Model Evaluation — a full treatment of classification and regression metrics, cross-validation strategy, and honest evaluation protocols.
Saving and loading models
Serialise a fitted sklearn pipeline with joblib, load it back, and verify the predictions are identical — the minimal deployment handoff.
Classification metrics
Why accuracy alone misleads on imbalanced data — and how precision, recall, F1, and the confusion matrix give a complete picture.