The production problem
Most AI orchestration demos fail in the same place: long-running partial failures. A tool call succeeds, the process crashes, and nobody knows which side effects already happened.
If your system cannot resume safely from mid-run checkpoints and explain policy and approval states, your incident response loop will be mostly guesswork.
What top ranking sources cover vs miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Microsoft AI agent orchestration patterns | Excellent pattern taxonomy for sequential, concurrent, handoff, and chat-style multi-agent designs. | No concrete contract for policy snapshots, approval binding, and decision audit APIs in runtime control planes. |
| Temporal durable execution docs | Clear durability framing for long-running workflows and crash recovery through persisted workflow state. | General orchestration guidance without built-in AI governance semantics like policy rule lineage. |
| Seven hosting patterns for AI agents | Practical hosting tradeoffs across cron, event-driven, daemon, API, workflow, and multi-agent mesh patterns. | Limited depth on approval queue architecture and compliance-grade evidence joins. |
Execution model that works
Reliable orchestration is a layered contract. Each layer answers one operational question clearly.
| Layer | Guarantee | Operational surface |
|---|---|---|
| DAG dependency graph | Explicit execution ordering with parallel fan-out where safe | Workflow definition + run timeline |
| Policy gate | Submit and dispatch decisions before worker side effects | Policy evaluate and job decisions APIs |
| Approval pause | Human checkpoint with bound snapshot and job hash | Approvals queue APIs |
| Retry contract | Automatic retries and backoff without losing run context | Workflow engine run state + rerun API |
| Run timeline | Append-only event chronology for audits and incident forensics | GET /api/v1/workflow-runs/{id}/timeline |
Workflow design patterns
Start with explicit DAG dependencies and add governance nodes where risk rises. This keeps happy-path latency reasonable while protecting high-impact actions.
id: incident-remediation
name: Incident remediation workflow
version: 1.0.0
steps:
detect:
type: job
topic: job.incident.detect
classify:
type: job
topic: job.incident.classify
depends_on: [detect]
patch_plan:
type: job
topic: job.incident.patch.plan
depends_on: [classify]
approval_gate:
type: approval
depends_on: [patch_plan]
apply_patch:
type: job
topic: job.incident.patch.apply
depends_on: [approval_gate]
retries:
max_attempts: 2
backoff_ms: 5000
notify:
type: notify
depends_on: [apply_patch]In Cordum, approval steps pause run progress, and denied outcomes are first-class terminal statuses. That distinction matters because policy denial and runtime failure require different response paths.
Run operations and recovery
Most platform teams only test success flows. Production reliability depends on rehearsed recovery paths.
# Start run with idempotency key
curl -sS -X POST http://localhost:8081/api/v1/workflows/WF_ID/runs -H 'X-API-Key: YOUR_API_KEY' -H 'X-Tenant-ID: default' -H 'Idempotency-Key: run-incident-001' -H 'Content-Type: application/json' -d '{"input":{"incident_id":"INC-7781"}}'
# Inspect run status
curl -sS http://localhost:8081/api/v1/workflow-runs/RUN_ID
# Read append-only timeline
curl -sS "http://localhost:8081/api/v1/workflow-runs/RUN_ID/timeline?limit=200"
# If needed, rerun from a failed step
curl -sS -X POST http://localhost:8081/api/v1/workflow-runs/RUN_ID/rerun -H 'X-API-Key: YOUR_API_KEY' -H 'X-Tenant-ID: default' -H 'Content-Type: application/json' -d '{"from_step":"apply_patch","dry_run":false}'| Control | Default | Why it exists |
|---|---|---|
| Step job id format | run_id:step_id@attempt | Makes retries and timeline events unambiguous |
| Run timeline store | wf:run:timeline:<run_id> | Supports deterministic post-mortem reconstruction |
| Pending replayer | enabled | Retries stale PENDING jobs past dispatch timeout |
| Denied status handling | first-class terminal status | Separates policy denial from generic failure |
| Reconciler | retries stuck runs | Prevents long-running workflows from silently stalling |
| Approval gate endpoint | GET /api/v1/approvals | Unified queue for workflow and policy approval decisions |
Limitations and tradeoffs
DAG orchestration, retries, and approvals add moving parts that need active ownership.
Checkpointing and governance gates increase latency compared to direct in-process loops.
Poor dependency modeling can create deadlocks or expensive retry storms under load.