The Cordum Agent Protocol (CAP) Specification
CAP is a state-driven protocol for ensuring deterministic agent execution, offering a safer alternative to unstructured ReAct loops. It defines the wire contract between AI agents and the Cordum control plane.
A distributed, open-source wire contract for AI agent job lifecycle. Protocol-first via protobuf. Standardized envelopes, typed payloads, and opaque pointers — so schedulers, workers, orchestrators, and gateways can interoperate without custom glue.
Defined via protobuf contracts. These pages define semantics and required behavior for compatibility. Language-agnostic by design.
Large data stays in external memory. Only opaque pointers travel on the wire, keeping the bus fast and messages small.
Every job is evaluated by the Safety Kernel before dispatch. Policy enforcement is built into the protocol, not bolted on after the fact.
Works with NATS (primary), Kafka, or any pub/sub with subjects and queue groups. The protocol defines semantics, not transport.
BusPacket
All CAP traffic is wrapped in a BusPacket envelope. It provides tracing, sender identity, and protocol negotiation around a single typed payload.
message BusPacket {
string trace_id = 1;
string sender_id = 2;
google.protobuf.Timestamp created_at = 3;
int32 protocol_version = 4;
oneof payload {
JobRequest job_request = 10;
JobResult job_result = 11;
Heartbeat heartbeat = 12;
JobProgress job_progress = 13;
JobCancel job_cancel = 14;
SystemAlert system_alert = 15;
}
bytes signature = 99; // optional digital signature
}Required Fields
trace_idCorrelates all packets for a request or workflowsender_idStable identifier for the emitting componentcreated_atTimestamp of emission (UTC)protocol_versionCAP wire version for negotiationpayloadExactly one of the six typed message typessignatureOptional digital signature for authenticity
Job submission with job_id, topic, priority, context_ptr, budget, meta (capability, risk_tags, requires), and compensation template.
Job completion with job_id, status, result_ptr, worker_id, execution_ms, error_code, and artifact_ptrs.
Worker liveness with worker_id, region, type, cpu/gpu load, active_jobs, capabilities, pool, and max_parallel_jobs.
Checkpoint with percent complete, message, and partial result_ptr for long-running tasks.
Cancellation signal with reason and requested_by for graceful job termination.
Alerts with level, message, component, and code for operational monitoring.
Structured Identity and Routing
Every JobRequest carries structured metadata for policy evaluation, routing, and observability:
tenant_id— Multi-tenancy isolationactor_id— Human or service actorcapability— Semantic action label (e.g., "sre.patch.apply")risk_tags— Policy hints (prod, write, network, secrets)requires— Capabilities for routing (kubectl, GPU, network)pack_id— Originating pack for observabilityidempotency_key— Dedupe key for retrieslabels— Free-form routing and observability hints
Payloads Stay Off the Bus
CAP keeps large payloads off the bus by referencing external memory through opaque URI pointers. Pointers are stable and immutable for the lifetime of the job.
context_ptrLocation of input payload written by the gateway or client.
result_ptrLocation of output payload written by the worker.
redacted_context_ptrSanitized input produced by the Safety Kernel on deny or throttle.
redis://, s3://, https://. Consumers treat them as opaque; dereferencing is implementation-specific. Gateways set TTL on context, workers set TTL on results.First-Class Policy Hook
CAP makes safety a first-class control-plane hook. The Safety Kernel is called before every job dispatch, returning a decision with reason, constraints, and optional redacted context.
service SafetyKernel {
rpc Check(PolicyCheckRequest)
returns (PolicyCheckResponse);
rpc Evaluate(PolicyCheckRequest)
returns (PolicyCheckResponse);
rpc Explain(PolicyCheckRequest)
returns (PolicyCheckResponse);
rpc Simulate(PolicyCheckRequest)
returns (PolicyCheckResponse);
rpc ListSnapshots(ListSnapshotsRequest)
returns (ListSnapshotsResponse);
}Policy Decisions
- AllowJob proceeds immediately
- DenyJob rejected, reason logged
- Require HumanPaused for out-of-band approval
- ThrottleDelayed and retried after backoff
- ConstrainAllowed with enforced limits
- UnavailableKernel unreachable, falls back to last-known-good policy
Job Lifecycle States
CAP standardizes job lifecycle states to keep schedulers and workers interoperable. Transitions are append-only — backwards transitions are rejected.
Sets PENDING, publishes to sys.job.submit
PENDING → SCHEDULED → DISPATCHED, calls safety
DISPATCHED → RUNNING → SUCCEEDED / FAILED
TIMEOUT or CANCELLED based on SLAs
Subject Conventions
CAP is transport-agnostic but provides recommended subject mappings. NATS is the primary transport; Kafka profiles are also supported.
| Subject | Purpose |
|---|---|
sys.job.submit | Job submission by gateways |
sys.job.result | Job results from workers |
sys.job.progress | Job progress updates |
sys.job.dlq | Dead letter queue entries |
sys.job.cancel | Job cancellation signals |
sys.heartbeat | Worker liveness (no queue groups) |
sys.workflow.event | Workflow engine events |
job.<pool> | Work distribution with queue groups |
worker.<id>.jobs | Direct worker dispatch |
Compensation & Rollback
JobRequest includes an optional compensation template — an inverse action dispatched on workflow rollback. Orchestrators log compensations after success and dispatch them in LIFO order on failure.
- FAILED_RETRYABLE for transient errors (rate limits, network)
- FAILED_FATAL triggers saga rollback with compensation stack
- idempotency_key for durable re-entry and deduplication
Tracing & Workflows
CAP supports hierarchical orchestration via workflow metadata fields. trace_id remains stable across entire workflow trees for end-to-end observability.
workflow_id— DAG identifierparent_job_id— Parent in the treestep_index— Position in DAGtrace_id— Stable across entire workflow
Multi-Language SDK Support
CAP SDKs provide typed handlers, runtime helpers, and Redis pointer hydration. Current protocol generation: CAP v2.
Full runtime SDK with pointer hydration and typed handlers
Reference worker implementation in examples/python-worker
Reference worker implementation in examples/node-worker
CAP vs MCP
- Distributed multi-agent control plane
- Job lifecycle, scheduling, policy enforcement
- Operates across clusters and nodes
- Protobuf wire contract with typed payloads
- Single-model tool calling protocol
- Tool discovery, invocation, and results
- Operates within a single model session
- Can be the tool layer inside a CAP worker
