Skip to content

Execution Model

Understanding how Datum executes streams helps you reason about performance, CPU usage, and the cost of different operator patterns.

Push vocabulary, pull implementation

The public API uses push-shaped names (push, pull, demand, PushOutlet). The internal execution of a linear fused chain is pull-based: the sink drives the loop by requesting elements from the chain one at a time through a Box<dyn Iterator<Item = StreamResult<T>>> (BoxStream). No element is produced until it is demanded.

This distinction matters for reading benchmark numbers and understanding where allocations occur. The "push" vocabulary describes the logical protocol for building GraphStage stages; the "pull" implementation is what makes linear chains allocation-free in the hot path.

The thread pool

Runtime (alias: Materializer) owns a small thread pool (StreamExecutor):

  • Workers are spawned on demand when a stream is materialized.
  • Idle workers are parked (not busy-spinning) and reaped after a 10-second timeout.
  • A spawn failure falls back to running the job inline on the calling thread.
  • The pool also owns the timer wheel (schedule_once, schedule_with_fixed_delay, schedule_at_fixed_rate).

For small, synchronous, bounded chains (e.g., a folded range), the executor may elect to run the chain inline rather than spawning a thread. .wait() handles both cases correctly.

Fused linear execution

A chain of synchronous operators is compiled into a single fused iterator. No per-element boxing, no inter-stage queues, no synchronization overhead. The chain runs entirely on one worker thread until the source is exhausted.

This is why simple operator chains benchmark at 5–8× warmed Akka: the Akka execution model processes elements through actor mailboxes with scheduling overhead per batch; Datum's fused path is a tight Rust iterator loop.

The graph executor: two tiers

The GraphDSL layer has two execution tiers:

Typed-linear fast path

When a graph consists only of a single linear chain (source → operators → sink) with no junctions, the executor detects this at materialization time and runs it through the same fused-iterator path as the linear DSL. This is monomorphized — no Box<dyn DatumElement>, no type erasure. Benchmark results: 16–46× warmed Akka on sync chains.

Erased executor (fallback)

Graphs that include junctions (Broadcast, Merge, Balance, Zip, etc.) or other structures the typed path cannot specialize fall back to the erased executor. Each element is boxed as Box<dyn DatumElement> — one allocation per element through the graph. At current benchmarks, the erased path runs at roughly 0.5–0.7× Akka throughput.

Widening the typed-linear fast path to cover more graph shapes is the primary open optimization tracked in roadmap/M1-v0.1.0-foundation.md.

Spin-then-park (actor ask path)

The ActorFlow::ask path (actor.rs) and the stream poller (BlockingPoller in stream.rs) use a spin-then-park strategy:

  1. Spin briefly (a small fixed number of iterations) to catch fast reply completions without going to sleep.
  2. If no completion arrives, park_timeout and wait to be woken by the reply gate or a FuturesUnordered completion.

The spin budget is load-bearing: it was tuned against benchmarks to hit sub-microsecond latency on the ordered_sum actor-ask path. The ordered_sum path spends approximately 2× CPU vs. wall-clock as a consequence — it wins wall-clock by spinning where Akka parks. This is a real cost. The benchmark tables report the CPU column explicitly so it is visible, not hidden.

Forcing earlier parking was measured at ~18× worse wall-clock on this path. The current tuning constants (STREAM_READY_SPINS, ASK_IDLE_YIELDS, ASK_MAX_PARK) represent the benchmark-optimal tradeoff for the actor ask use-case. Do not adjust them without before/after numbers on both wall-clock and CPU columns.

Tokio and async operators

Tokio is a hard dependency (Ractor runs on it). The pure-synchronous fused path uses no async runtime — it runs entirely on Datum's own thread pool. Async operators (map_async, actor interop, IO) dispatch onto the Tokio runtime. Standard-library future support is an optional portability surface.

Allocation profile

  • Linear fused chain: zero per-element allocations in the hot path.
  • Actor ask path: one heap allocation per delivered message from Ractor (~832 bytes/element), dominated by Ractor's internal process_message boxing. This is pinned upstream and cannot be eliminated through the public ActorRef<Msg> API without an upstream Ractor change.
  • Erased graph executor: one Box<dyn DatumElement> allocation per element through the graph.

Summary

PathMechanismTypical relative perf vs Akka
Linear fused (sync)Rust iterator loop, no boxing5–8× faster
Graph typed-linearMonomorphized, no type erasure16–46× faster
Graph erased executorBox<dyn DatumElement> per element0.5–0.7× (slower)
Actor ask (ordered_sum)Spin-then-park, Ractor boxing~parity at p1 (noisy), ~2–3× at p2–p4, ~parity at p16, ~2× CPU — see actor-ask.md

All numbers are against warmed Akka/Pekko under fair JMH warmup. See roadmap/benchmarks/ for the full per-scenario tables including the CPU column.