DAG thinking

Express a data pipeline as a directed acyclic graph — identify dependencies, find the critical path, and understand why cycles must be forbidden.

Most pipelines are described informally as a sequence of steps. That works until the pipeline grows: some steps can run in parallel, some depend on two predecessors, and a failed step should not necessarily abort unrelated branches. To reason about these properties precisely, the industry converged on a single abstraction: the directed acyclic graph, or DAG.

What a DAG is

A graph is a set of nodes connected by edges. In a pipeline DAG:

Each node is a task — a discrete unit of work with clear inputs and outputs.
Each directed edge from A to B means "B cannot start until A has succeeded".

Directed means edges have a direction (A → B, not A — B). Acyclic means there are no cycles: you cannot follow edges from a node back to itself.

   fetch_data
       │
       ▼
  validate_schema
       │
    ┌──┴──┐
    ▼     ▼
 clean  enrich           ← these two can run in parallel
    └──┬──┘
       ▼
  write_report

In this DAG, clean and enrich both depend on validate_schema but not on each other, so they can execute simultaneously. write_report depends on both and waits until whichever finishes last.

The critical path

The critical path is the longest sequence of dependent tasks from start to finish. It determines the minimum possible runtime for the pipeline, regardless of how much parallelism you add.

In the example above, if:

fetch_data takes 10 s
validate_schema takes 5 s
clean takes 20 s, enrich takes 8 s
write_report takes 3 s

The critical path is fetch → validate → clean → report = 38 s. Even if you run enrich on a thousand machines in parallel, the pipeline still takes at least 38 s. Optimising enrich alone buys nothing.

When a pipeline is too slow, profile the critical path first. Parallelising tasks not on the critical path yields no wall-clock improvement.

Why cycles are forbidden

A cycle would mean task A depends on B, and B depends on A. At runtime, each task waits for its dependencies to complete before starting. With a cycle, A waits for B and B waits for A — a classic deadlock. No task can ever start.

More subtly, a cycle also means there is no consistent definition of "inputs" and "outputs": the data produced by A is an input to B, but B's output is also an input to A. This circular dependency makes the computation undefined.

Orchestration tools enforce acyclicity at definition time, rejecting any DAG that contains a cycle before a single task executes.

Expressing a pipeline as a DAG

Translating an informal pipeline into a DAG is a design exercise:

List every discrete task and name it after what it produces, not what it does. raw_records, validated_records, enriched_records are better node names than step1, step2, step3.
For each task, ask: what does it need as input? Draw an edge from each input task.
Look for tasks with no shared dependencies. Those are candidates for parallelism.
Trace all paths from source to sink. The longest one is your critical path.

Where to go next

Next: Makefiles for pipelines — Make is the oldest widely-used DAG executor. Its target/dependency syntax maps directly onto the DAG model you just learned, and it is already installed on almost every developer machine.

Finished reading? Mark it complete to track your progress.

What a DAG is

The critical path

Why cycles are forbidden

Expressing a pipeline as a DAG

Where to go next

On this page