The failure pattern
The common failure is not a bad model prediction. It is an unsafe delegation from agent A to agent B where the governance boundary vanishes.
Local guardrails help. They do not solve cross-agent drift. Two agents can each pass local checks and still produce an unsafe side effect together.
Production systems need one runtime control layer that evaluates every action intent, enforces policy at the same decision point, and records an evidence trail you can audit later.
What top articles cover
| Source | Strong coverage | Missing piece |
|---|---|---|
| IBM: What is a Multi-Agent System? | Good architecture breakdown (centralized vs decentralized, hierarchies, coordination complexity). | No concrete pre-dispatch policy flow, approval-state behavior, or retry/idempotency safeguards. |
| Architecture & Governance: Enterprise Blueprint | Strong framing for registry, interaction governance, observability, and resilience controls. | No implementation details on fail-open vs fail-closed behavior, breaker thresholds, or dispatch semantics. |
| IMDA Model Governance Framework for Agentic AI | Clear guidance for significant human checkpoints, traceability, delegated authority records, and monitoring. | Policy guidance is strong, but it does not map directly to control-plane code paths and operational defaults. |
The missing runtime layer
The gap is usually implementation detail. Teams know they need governance. They do not define the exact checkpoints and defaults that decide what happens during failure.
| Layer | What must happen |
|---|---|
| Submit-time gate (gateway) | Policy is evaluated before state persistence and before bus publish. `deny` returns 403; `throttle` returns 429; `require_human` creates approval state with no dispatch. |
| Dispatch-time gate (scheduler) | Policy is evaluated again before dispatch. This catches drift between submit and execution windows. |
| Approval replay guard | Approved jobs re-enter the queue with explicit `approval_granted=true` labeling and job-hash verification. |
| Execution evidence | Run timeline and safety decision records make post-incident reconstruction possible without log archaeology. |
Reference architecture
- 1. Agent emits action intent to control plane.
- 2. Gateway evaluates policy before persisting state or publishing to the bus.
- 3. High-risk decisions move to approval state instead of dispatch.
- 4. Scheduler re-checks policy before selecting worker pool and dispatching.
- 5. Worker executes and returns result pointer; scheduler writes terminal state and DLQ if needed.
- 6. Timeline links intent, decision, approver, and result for incident replay.
| Control | Current value | Why it matters |
|---|---|---|
| Safety check timeout (scheduler client) | 2s | Bounds policy-check latency on the hot path before worker dispatch. |
| Circuit breaker open threshold | 3 failures | Trips quickly when the safety dependency is unhealthy. |
| Circuit breaker open duration | 30s | Prevents request storms while safety recovers. |
| Dispatch retry budget | 50 attempts | Caps retry storms; with 1s-30s backoff this is roughly 25 minutes max retry window. |
| Fail mode when safety is unavailable | `POLICY_CHECK_FAIL_MODE=open|closed` | Forces an explicit availability-vs-safety decision instead of accidental behavior. |
Failure matrix
| Failure mode | Decentralized controls | Centralized controls |
|---|---|---|
| Planner delegates delete action to ops agent | Ops agent local policy drift can permit an action planner should never approve. | Both submit and dispatch checkpoints evaluate the same policy snapshot and risk rules. |
| Network flap during approval publish | Duplicate retries can trigger duplicate side effects across pools. | Idempotency keys plus approval-gated requeue keep replay deterministic. |
| Safety kernel outage | Different agents pick different fallback behavior, usually undocumented. | One explicit fail mode (`open` or `closed`) and one circuit breaker policy. |
| Incident investigation after cross-agent cascade | Correlating logs across agents and tools is slow and often incomplete. | Run timeline links action, decision, approval, and result in one chain. |
Code: policy + dispatch guard
1) Policy rule for high-risk action
# safety.yaml
version: v1
rules:
- id: prod-delete-needs-approval
when:
topic: infra.delete
labels:
environment: production
decision: require_human
- id: deny-customer-notify-without-scope
when:
topic: customer.notify
labels:
recipient_scope: unverified
decision: deny2) Scheduler-side guard wiring
// scheduler bootstrap (simplified)
safetyClient, err := scheduler.NewSafetyClient(os.Getenv("SAFETY_KERNEL_ADDR"))
if err != nil {
return err
}
safetyClient = safetyClient.WithRedis(redisClient)
engine := scheduler.NewEngine(bus, safetyClient, registry, strategy, jobStore, metrics).
WithInputFailMode(os.Getenv("POLICY_CHECK_FAIL_MODE")) // open | closed
// Current runtime defaults in code:
// - safety timeout: 2s
// - breaker: 3 failures -> open for 30s
// - max scheduling retries: 503) Evidence record needed for incident replay
{
"job_id": "run_42:delete_prod_vm@1",
"topic": "infra.delete",
"policy_snapshot": "sha256:8f6f...",
"decision": "REQUIRE_HUMAN",
"rule_id": "prod-delete-needs-approval",
"approval_required": true,
"labels": {
"approval_granted": "true",
"environment": "production"
},
"run_timeline_event": "step_dispatched"
}Limitations and tradeoffs
- - Centralized control creates another critical dependency. You must design for high availability.
- - Extra policy checks add latency. Keep rules simple on the hot path and test p99 regularly.
- - Approval volume can explode if risk tiers are coarse. Calibrate thresholds with real incident data.
- - Fail-open mode improves availability but weakens safety guarantees. Use it intentionally, not by accident.
- - Local guardrails still matter. Centralized control is a coordinator, not a replacement for worker hygiene.
Next step
Run this rollout in 14 days:
- 1. Pick one risky topic family (for example `infra.*`) and force policy-before-dispatch there first.
- 2. Set `POLICY_CHECK_FAIL_MODE=closed` in production and document why.
- 3. Require approval for one irreversible action and measure queue time + false positive rate.
- 4. Simulate one safety dependency outage and verify breaker behavior and operator runbook.
- 5. Run one replay drill from run timeline data and confirm incident reconstruction time is under 30 minutes.
Continue with Multi-Agent Orchestration Needs a Control Plane and AI Agent Incident Report.