The production problem
Most AI agent orchestration articles explain patterns as diagrams.
That helps for design reviews. It does not keep incidents out of your on-call rotation.
In production, failures come from ordering and state guarantees: duplicate dispatch on retry, stale approvals, failed dependencies that leak into downstream steps, and fan-out that overwhelms workers.
If your orchestration layer cannot explain exactly what happened at each transition, your architecture is still a prototype.
What top AI agent orchestration articles miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Microsoft Learn: AI Agent Orchestration Patterns | Strong taxonomy: sequential, concurrent, handoff, group chat, and manager-led patterns. | No implementation-level guidance on pre-dispatch policy, run idempotency, and queue redelivery behavior. |
| LangChain docs: Multi-agent | Practical controller and handoff structures for tool and specialist agents. | Limited detail on deterministic approval, DLQ strategy, and state-recovery under partial failure. |
| OpenAI practical guide to building agents | Clear manager vs decentralized orchestration and a sane incremental adoption path. | Does not define control-plane contracts for dispatch ordering, dependency fences, and policy snapshot lineage. |
AI agent orchestration pattern map in Cordum
This is where pattern names meet real execution behavior.
| Pattern | Cordum model | Reliability control |
|---|---|---|
| Sequential DAG | `depends_on` + `scheduleReady()` dependency checks | Failed, denied, or timed-out dependencies stop downstream unless an `on_error` handler succeeds. |
| Parallel fan-out | `parallel` step type and `for_each` child expansion | `max_parallel` throttles active work. `for_each` hard-limited (default 1000 items) to cap blast radius. |
| Human-gated orchestration | `approval` step dispatches `sys.approval.gate` jobs | Workflow pauses in waiting state until approval result returns through the normal result path. |
| Nested orchestration | `subworkflow` step starts child runs | Child terminal statuses map back to parent with explicit error propagation and loop protection. |
| Specialist dispatch | Scheduler `PickSubject()` routes by topic, labels, requires, and load | Preferred worker hint is optional. Least-loaded routing and overload checks still apply. |
Concrete code paths
Step type model
// core/workflow/models.go (excerpt) const ( StepTypeWorker StepType = "worker" StepTypeApproval StepType = "approval" StepTypeCondition StepType = "condition" StepTypeDelay StepType = "delay" StepTypeNotify StepType = "notify" ) const ( StepTypeSwitch StepType = "switch" StepTypeParallel StepType = "parallel" StepTypeLoop StepType = "loop" StepTypeTransform StepType = "transform" StepTypeStorage StepType = "storage" StepTypeSubWorkflow StepType = "subworkflow" ) // generic-dispatch types const ( StepTypeLLM StepType = "llm" StepTypeHTTP StepType = "http" StepTypeContainer StepType = "container" StepTypeScript StepType = "script" )
Dependency gating logic
// core/workflow/engine_helpers.go (excerpt)
func depsSatisfied(step *Step, run *WorkflowRun, wfDef *Workflow) bool {
if step == nil || len(step.DependsOn) == 0 {
return true
}
for _, dep := range step.DependsOn {
sr, ok := run.Steps[dep]
if !ok || sr.Status == "" {
return false
}
if sr.Status == StepStatusSucceeded {
continue
}
if (sr.Status == StepStatusFailed || sr.Status == StepStatusDenied || sr.Status == StepStatusTimedOut) && wfDef != nil {
depDef := wfDef.Steps[dep]
if depDef != nil && depDef.OnError != "" {
handlerSR := run.Steps[depDef.OnError]
if handlerSR != nil && handlerSR.Status == StepStatusSucceeded {
continue
}
}
}
return false
}
return true
}Crash-safe dispatch ordering
// core/workflow/engine.go (excerpt)
// Persist state BEFORE dispatch for crash safety.
parentSR.Status = StepStatusRunning
parentSR.Attempts++
parentSR.JobID = jobID
run.Steps[stepID] = parentSR
if err := e.store.UpdateRun(ctx, run); err != nil {
// revert and retry
}
packet := makeJobPacket(run.ID, req)
if err := e.bus.Publish(capsdk.SubjectSubmit, packet); err != nil {
// revert to pending; idempotency key prevents duplicate execution
}Scheduler worker selection behavior
// core/controlplane/scheduler/strategy_least_loaded.go (excerpt)
func (s *LeastLoadedStrategy) PickSubject(req *pb.JobRequest, workers map[string]*pb.Heartbeat) (string, error) {
topicPools := routing.Topics[req.Topic]
eligiblePools := filterEligiblePools(topicPools, req.GetMeta().GetRequires(), routing.Pools)
if preferredWorker := labels["preferred_worker_id"]; preferredWorker != "" {
// used only if healthy + eligible + not overloaded
}
// otherwise choose least-loaded matching worker
if subject := bus.DirectSubject(selected.WorkerId); subject != "" {
return subject, nil
}
return req.Topic, nil
}Working workflow example
This workflow combines sequential, parallel, and approval orchestration. It is a practical baseline for incident response automation where rollback is sensitive.
id: incident_triage_parallel
name: Incident Triage With Governance
steps:
classify:
type: worker
topic: job.sre.classify
input:
incident_id: "${input.incident_id}"
summary: "${input.summary}"
fanout_diagnostics:
type: parallel
depends_on: [classify]
steps: [logs_scan, metric_scan]
strategy: all
logs_scan:
type: worker
topic: job.sre.logs
retry:
max_retries: 2
initial_backoff_sec: 1
max_backoff_sec: 8
multiplier: 2
metric_scan:
type: worker
topic: job.sre.metrics
approval_gate:
type: approval
depends_on: [fanout_diagnostics]
input:
approval_reason: "Apply production rollback"
next_effect: "Traffic shifts to last known good build"
rollback:
type: worker
depends_on: [approval_gate]
topic: job.sre.rollbackValidation runbook
Use this before broad rollout of any AI agent orchestration workflow:
# 1) Validate workflow engine behavior
cd D:/Cordum/cordum
go test ./core/workflow -run TestDepsSatisfiedWithFailedDepAndOnError -count=1
go test ./core/workflow -run TestScheduleReady -count=1
# 2) Validate scheduler routing behavior
go test ./core/controlplane/scheduler -run TestLeastLoadedStrategy -count=1
# 3) Start a run with idempotency key
curl -sS -X POST "http://localhost:8081/api/v1/workflows/incident_triage_parallel/runs" \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-H "Idempotency-Key: triage-2026-04-01-001" \
-d '{"incident_id":"INC-4821","summary":"p95 latency up 4x"}'
# 4) Re-send same idempotency key and confirm same run_id is returned
curl -sS -X POST "http://localhost:8081/api/v1/workflows/incident_triage_parallel/runs" \
-H "Content-Type: application/json" \
-H "X-Tenant-ID: default" \
-H "Idempotency-Key: triage-2026-04-01-001" \
-d '{"incident_id":"INC-4821","summary":"p95 latency up 4x"}'Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Single-agent loop only | Fast to ship for low-risk tasks. | Hard to isolate responsibility and recover cleanly during partial failure. |
| Multi-agent orchestration without control-plane guarantees | High flexibility and rapid experimentation. | Inconsistent behavior under retries, redelivery, and approval lag. |
| Pattern + governance coupling (Cordum model) | Predictable outcomes with auditable transitions and controlled blast radius. | Higher design discipline and more upfront test coverage required. |
- - More orchestration power means more responsibility for runbook quality.
- - Parallelism reduces latency but increases failure-surface area and observability load.
- - Approval gates reduce risk but can become throughput bottlenecks if policy scope is too broad.
FAQ
What is the most important AI agent orchestration pattern for production?
Start with explicit DAG dependencies plus approval gates. This gives you a predictable baseline before adding complex parallel or handoff logic.
Why is idempotency critical in AI agent orchestration?
Without idempotency, retries and duplicate submits can create duplicate side effects. Cordum supports run idempotency through the `Idempotency-Key` header.
Can I run parallel orchestration safely?
Yes, if you cap fan-out and enforce dependency and approval boundaries. Cordum uses max fan-out controls and step-level lifecycle tracking.
How does governance differ from orchestration?
Orchestration decides execution order. Governance decides whether execution is allowed, delayed for approval, or denied under policy.
Next step
Do this next sprint:
- 1. Pick one production workflow and map each step to a control-plane guarantee.
- 2. Add idempotency keys on run creation and verify duplicate-submit behavior.
- 3. Add at least one approval gate for a high-impact action and measure queue latency.
- 4. Run fault-injection tests for safety-kernel unavailability and stale worker routing.
Continue with Building Custom Safety Policies for AI Agents and AI Agent Production Deployment Checklist.