Saving and loading models
Serialise a fitted sklearn pipeline with joblib, load it back, and verify the predictions are identical — the minimal deployment handoff.
- Serialise a fitted pipeline with joblib.dump
- Load it with joblib.load and confirm predictions are byte-for-byte identical
- Understand what is and is not serialised (fitted state vs training data)
A trained model has no value if it exists only in the memory of a running process. The moment that process ends, the fit is lost. Serialisation converts the fitted object into bytes that can be written to disk, transferred to another machine, or loaded into a serving container — without repeating training.
Serialising with joblib
Sklearn's recommended serialiser is joblib, which handles numpy arrays
efficiently (pickled numpy arrays are compact; joblib makes them more so):
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Assume pipe has been fitted
joblib.dump(pipe, "model.pkl")Loading is symmetric:
loaded_pipe = joblib.load("model.pkl")
predictions = loaded_pipe.predict(X_new)The loaded pipeline is an exact copy of the fitted one: all fitted attributes
(coef_, mean_, scale_, etc.) are preserved. The training data is not
stored — only the parameters derived from it.
What is serialised
joblib.dump serialises the Python object graph rooted at the pipeline. This
includes:
- The pipeline structure (step names and step classes).
- All hyperparameters passed to constructors.
- All fitted attributes (the
_suffix attributes set duringfit).
It does not include the training data, the Python interpreter, or any libraries. The loading environment must have compatible versions of sklearn, numpy, and joblib, or deserialization can fail. Pin your dependencies.
Pickle-based serialisation (which joblib uses) executes arbitrary Python
code on load. Never load a .pkl file from an untrusted source. For
production serving with security requirements, consider ONNX or PMML
formats, which are data-format specifications rather than executable
serialisations.
Verifying the load
After loading, confirm that predictions on a held-out sample match exactly:
import numpy as np
assert np.allclose(
pipe.predict(X_test),
loaded_pipe.predict(X_test)
), "Loaded model produces different predictions"
print("Verification passed.")This is worth doing in a deployment pipeline as a smoke test — it catches version mismatches and file corruption before they reach users.
Versioning artefacts
A saved model without metadata is hard to trust. At minimum, store alongside the model file:
- The sklearn version (
sklearn.__version__). - The training dataset identifier or hash.
- The evaluation metrics from the held-out test set.
- The date of training.
A simple approach is to save a companion JSON file with these fields. More sophisticated ML platforms (MLflow, DVC) handle this automatically.
Where to go next
Next: the lab — build a complete pipeline with column-level preprocessing, evaluate it with cross-validation, then tune a hyperparameter with GridSearchCV.
Sklearn pipelines
Chain a scaler and a model into one object — and why Pipeline.fit() is the only safe way to avoid training-set leakage into preprocessing.
Lab: build a sklearn pipeline
Encode categoricals, scale numerics, add a model, evaluate with cross-validation, and tune max_depth — all inside one Pipeline.