Lab: build a sklearn pipeline

Encode categoricals, scale numerics, add a model, evaluate with cross-validation, and tune max_depth — all inside one Pipeline.

The previous lessons each handled one concern: scaling, cross-validation, trees. Real datasets mix categorical and numeric features, and a production pipeline must handle both correctly without leaking information across folds. This lab assembles all the pieces.

The dataset

We build a synthetic dataset with two numeric features and one categorical feature, plus a binary target. This pattern — mixed types — is the norm in practice (think: age + income as numerics, region as categorical).

Python — editable, runs in your browser

Checkpoint 2 — build the column transformer

ColumnTransformer applies different transformations to different columns. Numeric columns get StandardScaler; the categorical column gets OneHotEncoder. The remainder='drop' setting discards any column not explicitly listed.

Python — editable, runs in your browser

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

rng = np.random.default_rng(99)
n = 400
age    = rng.integers(18, 65, n).astype(float)
income = rng.normal(50000, 15000, n)
region = rng.choice(["north", "south", "east", "west"], n)
signal = (age < 35).astype(float) + (income > 50000).astype(float)
y = (signal + rng.normal(0, 0.5, n) > 1).astype(int)
df = pd.DataFrame({"age": age, "income": income, "region": region})
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=99)

preprocessor = ColumnTransformer([
  ("num", StandardScaler(),              ["age", "income"]),
  ("cat", OneHotEncoder(drop="first"),   ["region"]),
], remainder="drop")

preprocessor.fit(X_train)
X_transformed = preprocessor.transform(X_train)
print("Shape after transform:", X_transformed.shape)
print("Column count: 2 numeric + 3 OHE columns (4 regions - 1 dropped reference) = 5")

Checkpoint 3 — full pipeline with cross-validation

Python — editable, runs in your browser

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score

preprocessor = ColumnTransformer([
  ("num", StandardScaler(),              ["age", "income"]),
  ("cat", OneHotEncoder(drop="first"),   ["region"]),
], remainder="drop")

pipe = Pipeline([
  ("prep", preprocessor),
  ("clf",  DecisionTreeClassifier(max_depth=4, random_state=99)),
])

scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="accuracy")
print("CV scores:", scores.round(3))
print("Mean:     ", round(scores.mean(), 4))
print("Std:      ", round(scores.std(), 4))

The pipeline fits the ColumnTransformer inside each fold — the scaler and encoder never see the validation fold during fitting. That is the leak-free guarantee in action across five folds simultaneously.

Checkpoint 4 — tune max_depth with GridSearchCV

Python — editable, runs in your browser

preprocessor = ColumnTransformer([
  ("num", StandardScaler(),              ["age", "income"]),
  ("cat", OneHotEncoder(drop="first"),   ["region"]),
], remainder="drop")

pipe = Pipeline([
  ("prep", preprocessor),
  ("clf",  DecisionTreeClassifier(random_state=99)),
])

param_grid = {"clf__max_depth": [2, 3, 4, 6, 8, None]}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy", n_jobs=-1)
grid.fit(X_train, y_train)

print("Best depth:   ", grid.best_params_["clf__max_depth"])
print("Best CV acc:  ", round(grid.best_score_, 4))

test_acc = accuracy_score(y_test, grid.best_estimator_.predict(X_test))
print("Test acc:     ", round(test_acc, 4))

Notice the clf__ prefix in the parameter grid — that is the pipeline's step-name syntax for reaching inside the classifier. GridSearchCV uses cross-validation on the training set to select the best depth, then you evaluate grid.best_estimator_ on the held-out test set exactly once.

grid.best_estimator_ is already fitted on the full training set (because GridSearchCV sets refit=True by default). You can call .predict() on it directly without refitting.

Where to go next

The Sklearn in Practice module is complete. Next: Model Evaluation — a full treatment of classification and regression metrics, cross-validation strategy, and honest evaluation protocols.

Finished reading? Mark it complete to track your progress.

Lab: build a sklearn pipeline

The dataset

Checkpoint 2 — build the column transformer

Checkpoint 3 — full pipeline with cross-validation

Checkpoint 4 — tune max_depth with GridSearchCV

Where to go next

On this page