{
  "id": "HRN-010",
  "slug": "orchestration",
  "title": "Orchestration",
  "category": "Orchestration",
  "status": "Draft",
  "summary": "Orchestration is the harness layer that drives execution—single vs multi-agent topologies, supervisor/worker delegation, routing, state machines, and durable workflows—turning a plan into reliable, resumable action.",
  "updated": "Sun Jun 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)",
  "url": "https://santismm.com/en/handbook/orchestration",
  "evidence": {
    "evidenceLevel": "theoretical",
    "confidenceLevel": "medium",
    "sourceType": [
      "industry_observation",
      "personal_experience"
    ]
  },
  "related": [
    "HRN-003",
    "HRN-009",
    "PAT-002",
    "PAT-005"
  ],
  "tags": [
    "orchestration",
    "multi-agent",
    "supervisor-worker",
    "state-machine",
    "durable-execution"
  ],
  "headings": [
    "Executive Summary",
    "Key Concepts",
    "Definition",
    "Architecture Diagram",
    "Detailed Explanation",
    "Production Evidence",
    "Observed Failure Modes",
    "KPIs",
    "Cost Metrics",
    "Scaling Characteristics",
    "Related Content",
    "References",
    "FAQs"
  ],
  "markdown": "# Orchestration\n\n## Executive Summary\n\nOrchestration is the engine room of the harness: the layer that decides *who acts, in what order, and what happens when a step fails*. It spans the spectrum from a single agent running a loop to fleets of specialized agents coordinated by a supervisor. This chapter frames orchestration as the bridge between a represented plan (HRN-009) and reliable execution, and argues that the central engineering problem is not intelligence but **durability**: long-running, non-deterministic, partially-failing workflows must survive crashes, resume cleanly, and never silently lose or duplicate effects. The right default is the simplest topology that meets the requirement — complexity in orchestration is a cost, not a virtue.\n\n## Key Concepts\n\n- **Topology:** the arrangement of agents — single, pipeline, supervisor/worker, or network.\n- **Supervisor / orchestrator agent:** an agent that plans and delegates to workers (see PAT-002).\n- **Worker agent:** a specialized agent that executes a delegated sub-task (see PAT-005).\n- **Routing:** selecting the next agent, tool, or branch based on state.\n- **State machine:** an explicit graph of states and transitions governing execution.\n- **Durable execution:** workflow semantics where progress is checkpointed and resumable.\n- **Handoff:** transferring control and context from one agent to another.\n\n## Definition\n\n> **Orchestration** is the harness discipline of executing a plan across one or more agents and tools — selecting topology, routing control, coordinating state, and guaranteeing durable, exactly-the-right-number-of-times execution under failure.\n\n## Architecture Diagram\n\n```mermaid\nflowchart TD\n    subgraph Durable Workflow Engine\n      SUP[Supervisor Agent] -->|delegate| R{Router}\n      R -->|task A| W1[Worker: Retrieval]\n      R -->|task B| W2[Worker: Code/Tool]\n      R -->|task C| W3[Worker: Drafting]\n      W1 --> AGG[Aggregator / Reducer]\n      W2 --> AGG\n      W3 --> AGG\n      AGG --> SUP\n    end\n    SUP -->|checkpoint| ST[(Durable State Store)]\n    ST -->|resume after crash| SUP\n    SUP --> OUT[Verified Result]\n```\n\n## Detailed Explanation\n\n**Topology selection** is the first and most consequential decision. A *single agent* with tools is the correct default for most tasks: it is cheapest, easiest to observe, and has the fewest coordination failure modes. Reach for multi-agent only when the task genuinely benefits — when sub-tasks need *different* tool permissions, *different* context windows, or *parallel* independent execution. The common topologies are: **pipeline** (fixed sequence of stages), **supervisor/worker** (PAT-002 + PAT-005: a planner delegates to specialists and aggregates), and **network/peer** (agents hand off freely). Coordination cost rises sharply with topology freedom; peer networks are powerful but hardest to make reliable, govern, and debug.\n\n**Routing** is how control moves through the system. Routing can be *model-driven* (the supervisor chooses the next worker via tool-calling), *rule-driven* (deterministic transitions in a state machine), or *hybrid*. Deterministic routing is preferred wherever the path is known, because it is governable and testable; model-driven routing is reserved for genuinely open-ended branching. Encoding the workflow as an explicit **state machine** — states, allowed transitions, and guards — is the single highest-leverage reliability technique in orchestration: it bounds the space of behaviors, makes the system inspectable, and lets governance (HRN-008) attach controls to transitions.\n\n**Durability** is the property that separates a demo from a production system. Agentic workflows are long-running (seconds to hours), call flaky external tools, and may crash mid-flight. A durable execution engine checkpoints progress after each step so that on failure the workflow *resumes* from the last completed step rather than restarting. This demands careful effect semantics: tool calls with side effects must be **idempotent** or guarded by dedup keys so a resume does not double-charge a card or re-send an email. The hard cases are the *non-idempotent external effects*; the harness handles them with the saga pattern — record intent, execute, confirm, and provide compensating actions for partial failure.\n\n**State and context management** across agents is where multi-agent systems leak reliability. Each handoff (PAT-005) must transfer *exactly* the context the worker needs — too little and it fails, too much and it is expensive and prone to distraction. Shared state belongs in a durable store with clear ownership, not in a free-floating shared context window. Aggregation of worker outputs needs an explicit reducer with conflict resolution, because parallel workers will produce overlapping or contradictory results.\n\nFinally, orchestration owns **concurrency and failure isolation**. Parallel branches (exposed by the DAG plan from HRN-009) improve latency but require backpressure, rate-limit coordination across shared tools, and bulkheading so one failing worker cannot exhaust the budget or block siblings. Timeouts, circuit breakers, and per-worker budgets are orchestration concerns, not application concerns.\n\n## Production Evidence\n\n> **Illustrative / representative scenario.** Evidence level: theoretical · Confidence: medium · Source: industry_observation, personal_experience. The numbers below are representative ranges, not a measurement from one verified deployment.\n\n- **Context:** A research-and-synthesis agent answering complex enterprise questions.\n- **Scenario:** A supervisor decomposes a question, dispatches parallel retrieval/analysis workers, and aggregates a cited answer.\n- **Technology:** Durable workflow engine, supervisor/worker topology, deterministic router for known stages, dedup keys on side-effecting tools.\n- **Load:** Concurrent multi-worker runs; each run minutes long with several external tool calls.\n- **Results (representative):** Parallel fan-out commonly cuts wall-clock latency by a meaningful multiple over sequential execution, while durable checkpointing reduces failed-run rates by eliminating crash-induced full restarts. The cost is higher token spend (more agents, more context) and added coordination complexity.\n\n### Lessons Learned\n\nMost teams reach for multi-agent too early. The reliable progression is: make a single agent work, encode it as a state machine, add durability, *then* split into workers only where parallelism or permission isolation pays for the coordination cost.\n\n## Observed Failure Modes\n\n| Failure Mode | Trigger | Mitigation |\n|---|---|---|\n| Duplicate side effects | Resume re-runs a non-idempotent step | Idempotency keys / saga compensation |\n| Lost progress on crash | No checkpointing | Durable execution engine |\n| Context loss at handoff | Worker under-receives state | Explicit, typed handoff contracts |\n| Coordination deadlock | Workers wait on each other | Acyclic routing, timeouts, supervisor arbitration |\n| Cost explosion | Recursive/peer delegation unbounded | Per-run agent budget + delegation depth cap |\n| Conflicting aggregation | Parallel workers disagree | Explicit reducer with conflict resolution |\n| Shared-tool throttling | Workers hammer one rate-limited API | Centralized rate-limit + backpressure |\n\n## KPIs\n\n| Metric | Target | Notes |\n|---|---|---|\n| Task completion rate | High | End-to-end, verified |\n| Latency p50/p95/p99 | Minimized | Parallelism improves p50; tails dominated by slow workers |\n| Resume success rate | → 100% | Workflows that recover after a crash |\n| Duplicate-effect rate | → 0 | Idempotency correctness |\n| Cost per task | Bounded | Caps on agents/depth/tokens |\n| Throughput | Scales with concurrency | Limited by shared-tool rate limits |\n\n## Cost Metrics\n\n- **Token cost** grows with agent count and per-agent context; multi-agent is materially more expensive than single-agent for the same task.\n- **Orchestration overhead:** supervisor planning + aggregation inference per run.\n- **Durability overhead:** checkpoint writes (cheap) vs. the large savings from not restarting failed runs.\n\n## Scaling Characteristics\n\nSingle-agent throughput scales horizontally and statelessly. Supervisor/worker scales sub-tasks in parallel up to shared-tool rate limits, which become the true ceiling. Durable workflow engines scale with the number of in-flight workflows; checkpoint storage and the dispatcher are the components to size. Peer/network topologies scale worst — coordination overhead and failure surface grow super-linearly with agent count, which is why bounded supervisor topologies are the enterprise default.\n\n## Related Content\n\n- HRN-003 — Orchestration's place in the harness taxonomy.\n- HRN-009 — The plan that orchestration executes.\n- PAT-002 — Supervisor Agent pattern.\n- PAT-005 — Multi-Agent Delegation pattern.\n\n## References\n\n- Temporal / durable-execution workflow engines (Saga pattern, workflow durability).\n- Anthropic, \"Building Effective Agents\" (single-agent-first, topology guidance).\n- LangGraph and state-machine orchestration for agents.\n\n## FAQs\n\n**Q: Single agent or multi-agent?**\nA: Default to single agent. Add agents only for parallelism or permission/context isolation that pays back the coordination cost.\n\n**Q: Why a state machine instead of free-form agent loops?**\nA: State machines bound behavior, are testable, and let governance attach controls to transitions. Free-form loops are powerful but hard to make reliable or auditable.\n\n**Q: How do I avoid double-charging a customer on retry?**\nA: Make side-effecting tool calls idempotent (dedup keys) or wrap them in a saga with compensating actions, and run on a durable engine that resumes rather than restarts."
}