The production problem
Multi-agent stacks now have interoperability standards for tools and delegation. That still leaves one unresolved question: which protocol surface controls execution risk before a job is published?
Teams usually bolt this on late. Then an autonomous AI agent emits a valid write request with incomplete context, no approval binding, and no compensation metadata. The protocol call succeeds. Operations fail.
What top ranking sources cover vs miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| StackOne MCP vs A2A | Good architectural separation between tool integration and agent coordination, including failure mode discussion. | No typed governance envelope that binds policy decisions to approval and dispatch semantics. |
| WorkOS MCP vs A2A | Clear explanation of where MCP and A2A each stop in real systems. | Governance remains conceptual. No explicit wire-level decision contract or rollback primitive. |
| DigitalOcean A2A vs MCP | Layer comparison, pros/cons, and strong security framing for protocol boundaries. | No production blueprint for policy check outcomes, approval binding, and result-pointer auditability. |
The gap is consistent. Most articles stop at layer separation. Few explain what concrete capabilities a governance protocol must expose to be operationally useful.
Core CAP capabilities
CAP's value is not a single feature. It is the combination of typed transport, policy-enforced dispatch, pointer-based payload handling, and recovery-safe metadata.
| Capability | Why it matters | CAP surface |
|---|---|---|
| Typed envelope for all bus events | Prevents ad-hoc message drift between gateway, scheduler, and workers. | BusPacket with oneof payload types |
| Pre-dispatch policy outcomes | Blocks unsafe actions before they run, instead of only logging them after. | ALLOW, DENY, REQUIRE_APPROVAL, THROTTLE, ALLOW_WITH_CONSTRAINTS |
| Approval binding | Ensures human approvals map to the reviewed policy snapshot and job intent. | approval_required + approval_ref |
| Pointer-first payload handling | Keeps transport payloads small while preserving full context and result data. | context_ptr, result_ptr, artifact pointers |
| Checkpoint heartbeats | Provides progress visibility and better operator decisions during long-running jobs. | Heartbeat and JobProgress messages |
| Rollback metadata | Gives orchestrators enough data to run deterministic compensation paths. | Compensation templates tied to terminal failure semantics |
Wire contract details
CAP keeps one envelope (`BusPacket`) and several typed payloads. That decision alone removes a lot of schema drift between components.
{
"trace_id": "trace-ops-2026-04-01-001",
"sender_id": "api-gateway-1",
"created_at": "2026-04-01T11:32:10Z",
"protocol_version": 1,
"job_request": {
"job_id": "job-9f0f3a",
"topic": "job.mcp-bridge.write.update_ticket",
"tenant_id": "default",
"context_ptr": "redis://ctx:job-9f0f3a",
"labels": {
"mcp.server": "jira",
"mcp.tool": "update_ticket",
"mcp.action": "write"
}
}
}| Message | Purpose | Common fields |
|---|---|---|
| JobRequest | Describe the unit of work and where to load context. | job_id, topic, context_ptr, labels, tenant_id |
| JobResult | Report terminal or in-flight state back to scheduler/workflow engine. | status, result_ptr, worker_id, error_code |
| Heartbeat | Report worker health and running pressure. | worker_id, cpu_load, active_jobs, pool |
| JobProgress | Expose partial progress for long operations. | percent, message, optional pointers |
| JobCancel | Signal cooperative cancellation for in-flight work. | job_id, reason, requested_by |
Policy and approval flow
CAP decisions are useful because they are executable states, not advisory labels. For example, `REQUIRE_APPROVAL` means the job enters approval flow before dispatch.
version: v1
rules:
- id: allow-read-path
match:
labels:
mcp.action: "read"
decision: allow
reason: "Read operations are allowed"
- id: approval-prod-write
match:
labels:
mcp.action: "write"
risk_tags: ["prod"]
decision: require_approval
reason: "Production writes require human review"
- id: deny-destructive
match:
labels:
mcp.action: "delete"
decision: deny
reason: "Destructive actions blocked"
- id: constrain-heavy-jobs
match:
topics: ["job.batch.*"]
decision: allow_with_constraints
constraints:
max_runtime_sec: 120
max_retries: 1# Submit a write job that should require approval
curl -sS -X POST http://localhost:8081/api/v1/jobs -H "Content-Type: application/json" -d '{
"topic":"job.mcp-bridge.write.update_ticket",
"tenant_id":"default",
"risk_tags":["prod"],
"labels":{"mcp.action":"write"}
}'
# List pending approvals
curl -sS "http://localhost:8081/api/v1/approvals?include_resolved=false"
# Approve one job after review
curl -sS -X POST "http://localhost:8081/api/v1/approvals/<job_id>/approve" -H "Content-Type: application/json" -d '{"note":"approved in maintenance window"}'{
"job_id": "job-9f0f3a",
"status": "FAILED_FATAL",
"error_code": "ERROR_CODE_JOB_TIMEOUT",
"compensation": {
"topic": "job.mcp-bridge.write.revert_ticket",
"context_ptr": "redis://ctx:job-9f0f3a:undo",
"meta": {
"idempotency_key": "job-9f0f3a/undo"
}
}
}A small but important number: safety checks run on hot paths with short client timeouts (`2s` in the current safety-kernel reference), so policy enforcement stays synchronous without turning the scheduler into a queueing bottleneck.
Limitations and tradeoffs
Teams must keep message contracts stable across services and versions. That takes governance work.
Human gates improve safety but add waiting time for high-impact write paths.
Heartbeat and progress signals are useful only when operators define thresholds and response runbooks.