Code of the Day
AdvancedSklearn in Practice

Lab: build a sklearn pipeline

Encode categoricals, scale numerics, add a model, evaluate with cross-validation, and tune max_depth — all inside one Pipeline.

Lab · optionalData ScienceAdvanced30 min
By the end of this lesson you will be able to:
  • Construct a ColumnTransformer that handles categorical and numeric features separately
  • Wrap it with a DecisionTreeClassifier in a Pipeline
  • Evaluate with 5-fold cross-validation
  • Run GridSearchCV over max_depth and interpret best_params_ and best_score_

The previous lessons each handled one concern: scaling, cross-validation, trees. Real datasets mix categorical and numeric features, and a production pipeline must handle both correctly without leaking information across folds. This lab assembles all the pieces.

The dataset

We build a synthetic dataset with two numeric features and one categorical feature, plus a binary target. This pattern — mixed types — is the norm in practice (think: age + income as numerics, region as categorical).

Python — editable, runs in your browser

Checkpoint 2 — build the column transformer

ColumnTransformer applies different transformations to different columns. Numeric columns get StandardScaler; the categorical column gets OneHotEncoder. The remainder='drop' setting discards any column not explicitly listed.

Python — editable, runs in your browser

Checkpoint 3 — full pipeline with cross-validation

Python — editable, runs in your browser

The pipeline fits the ColumnTransformer inside each fold — the scaler and encoder never see the validation fold during fitting. That is the leak-free guarantee in action across five folds simultaneously.

Checkpoint 4 — tune max_depth with GridSearchCV

Python — editable, runs in your browser

Notice the clf__ prefix in the parameter grid — that is the pipeline's step-name syntax for reaching inside the classifier. GridSearchCV uses cross-validation on the training set to select the best depth, then you evaluate grid.best_estimator_ on the held-out test set exactly once.

grid.best_estimator_ is already fitted on the full training set (because GridSearchCV sets refit=True by default). You can call .predict() on it directly without refitting.

Where to go next

The Sklearn in Practice module is complete. Next: Model Evaluation — a full treatment of classification and regression metrics, cross-validation strategy, and honest evaluation protocols.

Finished reading? Mark it complete to track your progress.

On this page