Lab: build a complete ML pipeline
Ingest, clean, engineer features, split, scale, train, evaluate, and persist — each step as a tested, reproducible function.
- Implement each pipeline stage as a pure function with an associated test
- Assemble the stages into an end-to-end pipeline that runs reproducibly
- Evaluate the final model and produce a summary report
- Persist the fitted pipeline with joblib
A notebook is not a pipeline. A pipeline is a sequence of composable, tested, deterministic functions that can be run again from scratch and produce the same output. This lab builds one, step by step.
Stage 1 — ingest and inspect
Stage 2 — clean
Stage 3 — feature engineering and split
Stage 4 — build pipeline, tune, and evaluate
Stage 5 — persist the pipeline
Once you have the fitted pipeline, serialise it for later use. In a real project, this step writes to a versioned artefact store. Here it writes to a temporary file and verifies the predictions are identical after loading:
import joblib
import numpy as np
# Save
joblib.dump(best, "/tmp/churn_pipeline_v1.pkl")
# Load and verify
loaded = joblib.load("/tmp/churn_pipeline_v1.pkl")
assert np.allclose(best.predict(X_test), loaded.predict(X_test))
print("Serialisation verified.")The complete pipeline — ColumnTransformer, scalers, encoders, and fitted
tree — is bundled in a single object. Loading it on a different machine
(with compatible sklearn/numpy versions) produces exactly the same
predictions, which is the minimal requirement for a reproducible deployment.
Where to go next
You have completed the Advanced Data Science track. The skills built across these five modules — model selection, sklearn pipelines, rigorous evaluation, time series analysis, and pipeline engineering — form the core toolkit for taking analysis from notebook to production.