Code of the Day
AdvancedPipeline Design

Reproducible pipelines

Pin seeds, hash intermediates, version your data — the checklist for pipelines that produce the same output from the same input, every time.

Data ScienceAdvanced6 min read
Recommended first
By the end of this lesson you will be able to:
  • Apply idempotency to data pipeline steps
  • Pin random seeds in Python and numpy to ensure deterministic outputs
  • Use file checksums to detect when intermediate data has changed

A reproducible pipeline produces the same outputs from the same inputs, regardless of when, where, or by whom it is run. That is a stricter requirement than it sounds. Most notebooks and scripts that "work" are not reproducible — they depend on implicit state, random seeds that change each run, or intermediate files whose provenance is unknown.

Reproducibility matters for three practical reasons: debugging (if two runs produce different results, which is correct?), collaboration (a colleague cannot verify your work if they cannot replicate it), and trust (a model whose training cannot be reproduced cannot be audited).

Idempotency

An idempotent operation produces the same result whether it runs once or a thousand times on the same input. Write each pipeline step as a pure function: given the same input data, it always produces the same output, with no side effects on shared state.

The natural violation: functions that depend on the current date, system environment variables, or network resources at call time. These make the output a function of context, not just input. Wherever possible, make that context explicit — pass the date as an argument, not datetime.now().

Pinning random seeds

Any step that uses randomness must pin a seed. In Python:

import random
import numpy as np

random.seed(42)
np.random.seed(42)

For numpy's newer generator API (preferred):

rng = np.random.default_rng(42)

For sklearn estimators, pass random_state=42 to every constructor that accepts it: train_test_split, RandomForestClassifier, KMeans, etc. A single unseeded step anywhere in the pipeline breaks end-to-end reproducibility.

Hashing intermediate files

If a pipeline step is expensive (loading 50 GB, training a model for 2 hours), you want to cache its output and skip it on subsequent runs when the input has not changed. The correct way to detect "input has not changed" is a checksum:

import hashlib

def file_hash(path: str) -> str:
    h = hashlib.sha256()
    with open(path, "rb") as f:
        for chunk in iter(lambda: f.read(65536), b""):
            h.update(chunk)
    return h.hexdigest()

Store the hash alongside the cached output. Before running the step, check whether the input hash matches the stored hash. If it matches, load the cache. If not, rerun and update the hash. This is the logic that build systems like Make, DVC, and Makefile targets implement.

Versioning training data

A model is only as reproducible as the data it was trained on. Store a reference to the exact version of the training dataset alongside the saved model — either a hash of the data file, a DVC data version, or an immutable S3 object version. Without this, "retrain with the same data" is not possible six months later when the source database has been modified.

The most common reproducibility failure is forgetting to pin the seed on a train/test split. If the split is different, every downstream metric is different. Always pass random_state= to train_test_split, and store the seed value in your model metadata.

Where to go next

With reproducibility principles established, the next lesson operationalises them: testing pipelines — unit tests for transformation functions, schema validation with pandera, and assertions that detect data leakage.

Finished reading? Mark it complete to track your progress.

On this page