Lab: containerise a pipeline

Write a Dockerfile and compose.yaml for the hardened pipeline from Module 1, run it with Docker Compose, and verify that output files appear on the host filesystem.

The pipeline from Module 1 is tested, hardened with checkpoints and retry logic, and ready to ship. This lab packages it into a Docker image so it runs identically on any machine with Docker installed — no Python version mismatch, no missing tenacity install, no "works on my machine" surprises.

The project layout

Start with this directory structure:

my-pipeline/
  pipeline.py          # the hardened script from Module 1
  requirements.txt
  Dockerfile
  compose.yaml
  output/              # created by docker compose run; add to .gitignore
  .env                 # local secrets; add to .gitignore

Step 1 — requirements.txt

tenacity==8.3.0
requests==2.32.3

Pin exact versions. Floating requirements (tenacity>=8) produce non-reproducible images: two builds a month apart can produce different package sets.

Step 2 — Dockerfile

# syntax=docker/dockerfile:1

FROM python:3.12-slim

WORKDIR /app

# Copy and install dependencies first (cache the expensive layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy source
COPY pipeline.py .

# Security: drop root
RUN useradd --system --no-create-home pipeline
USER pipeline

# Output directory (created at runtime via volume mount)
ENV OUTPUT_DIR=/app/output

ENTRYPOINT ["python", "pipeline.py"]

Checkpoint 1 — build the image

docker build -t my-pipeline:dev .

Confirm the build succeeds. Then change a comment in pipeline.py and rebuild. Observe that layers 1–4 are cached and only layer 5 (COPY pipeline.py .) rebuilds. The total rebuild time should be under two seconds.

Step 3 — compose.yaml

# compose.yaml

services:
  pipeline:
    build: .
    environment:
      API_KEY: "${API_KEY}"
      OUTPUT_DIR: /app/output
    volumes:
      - ./output:/app/output

Checkpoint 2 — run with Compose

Create a .env file:

API_KEY=dev-placeholder

Then run:

docker compose run --rm pipeline

The pipeline should execute and write its output to ./output/ on the host. Verify:

ls -la output/
cat output/transformed.json

If the files are not there, check that the OUTPUT_DIR environment variable matches the path used in pipeline.py and that the volume mount is correct.

If ./output does not exist, Docker creates it automatically as root on Linux, which can cause permission errors when the container (running as the pipeline user) tries to write to it. Pre-create the directory with mkdir -p output on the host to avoid this.

Step 4 — pass the API key as a secret

In production you would not store real secrets in .env. Three common patterns:

CI/CD injection — the pipeline runs in GitHub Actions or another CI system that injects secrets as environment variables:

# .github/workflows/pipeline.yml  (covered in Module 5)
env:
  API_KEY: ${{ secrets.API_KEY }}

Docker secret — for Docker Swarm or Compose v2 with secrets:

services:
  pipeline:
    secrets:
      - api_key
    environment:
      API_KEY_FILE: /run/secrets/api_key

secrets:
  api_key:
    environment: "API_KEY"   # reads from host env at deploy time

Then in pipeline.py, read os.environ.get("API_KEY") or open(os.environ["API_KEY_FILE"]).read().strip().

AWS/GCP/Azure secrets manager — the container IAM role grants permission to fetch the secret at startup.

For this lab, the .env approach is fine. Document the switch to a secrets manager as a follow-up task.

Runnable demo

The demo below simulates the containerised pipeline execution — showing what you would see in docker compose run output without requiring Docker to be installed.

Python — editable, runs in your browser

import json, os, tempfile
from pathlib import Path

# ── Simulated containerised pipeline ────────────────────────────────────────
# Reads config from environment (injected by compose.yaml)
# Writes output to OUTPUT_DIR (mounted as a volume)

def main():
  api_key    = os.environ.get("API_KEY", "dev-placeholder")
  output_dir = Path(os.environ.get("OUTPUT_DIR", "/tmp/pipeline_output"))
  output_dir.mkdir(parents=True, exist_ok=True)

print(f"API_KEY  = {api_key[:4]}{'*' * (len(api_key) - 4)}")
  print(f"OUTPUT   = {output_dir}")
  print()

# Step 1: fetch (simulated)
  records = [{"id": i, "value": i * 10} for i in range(1, 6)]
  raw_path = output_dir / "raw.json"
  raw_path.write_text(json.dumps(records, indent=2))
  print(f"[fetch]     wrote {len(records)} records to {raw_path.name}")

# Step 2: transform
  transformed = [{"id": r["id"], "value": r["value"] * 2} for r in records]
  out_path = output_dir / "transformed.json"
  out_path.write_text(json.dumps(transformed, indent=2))
  print(f"[transform] wrote {len(transformed)} records to {out_path.name}")

# Step 3: verify output is readable (simulates volume mount check)
  loaded = json.loads(out_path.read_text())
  assert len(loaded) == len(transformed)
  print(f"[verify]    output file readable, {len(loaded)} records confirmed")

print()
  print("Pipeline completed. Output files:")
  for f in sorted(output_dir.iterdir()):
      print(f"  {f.name}  ({f.stat().st_size} bytes)")

# Inject environment variables for the demo
os.environ["API_KEY"] = "sk-dev-placeholder-0000"
os.environ["OUTPUT_DIR"] = "/tmp/demo_output"

main()

Checkpoint 3 — multi-stage build (extension)

For production images, a multi-stage build separates the build environment from the runtime environment:

# Stage 1: install packages
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: runtime image
FROM python:3.12-slim
COPY --from=builder /install /usr/local
WORKDIR /app
COPY pipeline.py .
RUN useradd --system --no-create-home pipeline
USER pipeline
ENTRYPOINT ["python", "pipeline.py"]

This keeps build-time tooling out of the runtime image. For most pipeline scripts the size reduction is modest, but it is the standard pattern for production-grade images.

Where to go next

Module complete. Next up: CI/CD for Automation — automate the Docker build and pipeline run inside GitHub Actions so every push to main triggers a tested, containerised deployment.

Finished reading? Mark it complete to track your progress.

On this page