Code of the Day
AdvancedPipeline Design

Testing data pipelines

Unit-test transformation functions, validate DataFrame schemas with pandera, and assert that no training-set statistics leaked into test data.

Data ScienceAdvanced10 min read
Recommended first
By the end of this lesson you will be able to:
  • Write unit tests for a data transformation function using plain assert
  • Validate DataFrame column types and value ranges with pandera
  • Write an assertion that confirms no training-set statistics were computed on test data

Data pipelines fail silently. A transformation bug does not raise an exception — it produces a subtly wrong DataFrame that propagates through training and degrades model quality in ways that only show up at evaluation time, or worse, in production. Testing is the practice of detecting those failures before they propagate.

Unit-testing transformation functions

The first rule: write transformations as pure functions that take a DataFrame and return a DataFrame. Pure functions are testable — you can construct a small input, call the function, and assert the output matches expectations.

Python — editable, runs in your browser

These tests are framework-free (no pytest import needed in the runner). In a real project, these would be in a tests/ directory and run with pytest. The key habit is the same: construct a small, controlled input; call the function; assert the output is exactly what you expect.

Schema validation with pandera

Transformation tests check logic; schema tests check structure. pandera lets you declare the expected schema of a DataFrame — column names, types, value ranges — and raise an error when data violates it.

Python — editable, runs in your browser

Asserting no data leakage

Data leakage in a preprocessing step means the test set's statistics influenced the training data's transformation. The canonical example: fitting a StandardScaler on the combined train+test set instead of train only. The assertion is simple: verify that the scaler's mean was computed before the test set existed:

Python — editable, runs in your browser

In practice, leakage is harder to detect than this minimal example suggests. Subtle leakage occurs when a feature is engineered using a statistic computed over the full dataset before splitting — for example, computing a target-mean encoding before calling train_test_split. The rule: all statistics must be computed after splitting, on training data only.

Where to go next

The lab brings everything together: a complete, tested, reproducible ML pipeline in a single notebook — ingest, clean, engineer, split, scale, train, evaluate, and persist.

Finished reading? Mark it complete to track your progress.

On this page