Skip to content
Documentation

The Cordum Agent Protocol (CAP) Specification

CAP is a state-driven protocol for ensuring deterministic agent execution, offering a safer alternative to unstructured ReAct loops. It defines the wire contract between AI agents and the Cordum control plane.

A distributed, open-source wire contract for AI agent job lifecycle. Protocol-first via protobuf. Standardized envelopes, typed payloads, and opaque pointers — so schedulers, workers, orchestrators, and gateways can interoperate without custom glue.

Core Principles
Protocol-First

Defined via protobuf contracts. These pages define semantics and required behavior for compatibility. Language-agnostic by design.

Payload Off the Bus

Large data stays in external memory. Only opaque pointers travel on the wire, keeping the bus fast and messages small.

Safety as First-Class Hook

Every job is evaluated by the Safety Kernel before dispatch. Policy enforcement is built into the protocol, not bolted on after the fact.

Transport Agnostic

Works with NATS (primary), Kafka, or any pub/sub with subjects and queue groups. The protocol defines semantics, not transport.

Envelope

BusPacket

All CAP traffic is wrapped in a BusPacket envelope. It provides tracing, sender identity, and protocol negotiation around a single typed payload.

buspacket.proto
message BusPacket {
  string trace_id = 1;
  string sender_id = 2;
  google.protobuf.Timestamp created_at = 3;
  int32 protocol_version = 4;

  oneof payload {
    JobRequest  job_request  = 10;
    JobResult   job_result   = 11;
    Heartbeat   heartbeat    = 12;
    JobProgress job_progress = 13;
    JobCancel   job_cancel   = 14;
    SystemAlert system_alert = 15;
  }

  bytes signature = 99; // optional digital signature
}

Required Fields

  • trace_idCorrelates all packets for a request or workflow
  • sender_idStable identifier for the emitting component
  • created_atTimestamp of emission (UTC)
  • protocol_versionCAP wire version for negotiation
  • payloadExactly one of the six typed message types
  • signatureOptional digital signature for authenticity
Message Types
JobRequest

Job submission with job_id, topic, priority, context_ptr, budget, meta (capability, risk_tags, requires), and compensation template.

JobResult

Job completion with job_id, status, result_ptr, worker_id, execution_ms, error_code, and artifact_ptrs.

Heartbeat

Worker liveness with worker_id, region, type, cpu/gpu load, active_jobs, capabilities, pool, and max_parallel_jobs.

JobProgress

Checkpoint with percent complete, message, and partial result_ptr for long-running tasks.

JobCancel

Cancellation signal with reason and requested_by for graceful job termination.

SystemAlert

Alerts with level, message, component, and code for operational monitoring.

Job Metadata

Structured Identity and Routing

Every JobRequest carries structured metadata for policy evaluation, routing, and observability:

  • tenant_id — Multi-tenancy isolation
  • actor_id — Human or service actor
  • capability — Semantic action label (e.g., "sre.patch.apply")
  • risk_tags — Policy hints (prod, write, network, secrets)
  • requires — Capabilities for routing (kubectl, GPU, network)
  • pack_id — Originating pack for observability
  • idempotency_key — Dedupe key for retries
  • labels — Free-form routing and observability hints
Pointer System

Payloads Stay Off the Bus

CAP keeps large payloads off the bus by referencing external memory through opaque URI pointers. Pointers are stable and immutable for the lifetime of the job.

context_ptr

Location of input payload written by the gateway or client.

redis://ctx/job-123
result_ptr

Location of output payload written by the worker.

redis://res/job-123
redacted_context_ptr

Sanitized input produced by the Safety Kernel on deny or throttle.

redis://ctx/job-123:redacted
Format: Pointers are opaque URIs — redis://, s3://, https://. Consumers treat them as opaque; dereferencing is implementation-specific. Gateways set TTL on context, workers set TTL on results.
Safety Integration

First-Class Policy Hook

CAP makes safety a first-class control-plane hook. The Safety Kernel is called before every job dispatch, returning a decision with reason, constraints, and optional redacted context.

safety.proto
service SafetyKernel {
  rpc Check(PolicyCheckRequest)
      returns (PolicyCheckResponse);
  rpc Evaluate(PolicyCheckRequest)
      returns (PolicyCheckResponse);
  rpc Explain(PolicyCheckRequest)
      returns (PolicyCheckResponse);
  rpc Simulate(PolicyCheckRequest)
      returns (PolicyCheckResponse);
  rpc ListSnapshots(ListSnapshotsRequest)
      returns (ListSnapshotsResponse);
}

Policy Decisions

  • AllowJob proceeds immediately
  • DenyJob rejected, reason logged
  • Require HumanPaused for out-of-band approval
  • ThrottleDelayed and retried after backoff
  • ConstrainAllowed with enforced limits
  • UnavailableKernel unreachable, falls back to last-known-good policy
Performance: Safety checks complete in < 5ms. On outage, schedulers fail-closed by default. Every decision is logged with trace_id, job_id, decision, and reason.
State Machine

Job Lifecycle States

CAP standardizes job lifecycle states to keep schedulers and workers interoperable. Transitions are append-only — backwards transitions are rejected.

PENDING
APPROVAL_REQUIRED
SCHEDULED
DISPATCHED
RUNNING
SUCCEEDED
FAILED
TIMEOUT
CANCELLED
DENIED
→ PENDING → APPROVAL_REQUIRED (if policy requires human approval)
→ PENDING → SCHEDULED → DISPATCHED → RUNNING → SUCCEEDED / FAILED
→ DISPATCHED / RUNNING → TIMEOUT (reconciler marks stale)
→ PENDING → DENIED (safety rejects)
→ Any non-terminal → CANCELLED
Note: JobResult also supports FAILED_RETRYABLE and FAILED_FATAL status codes for worker-level failure categorization.
Gateway

Sets PENDING, publishes to sys.job.submit

Scheduler

PENDING → SCHEDULED → DISPATCHED, calls safety

Worker

DISPATCHED → RUNNING → SUCCEEDED / FAILED

Reconciler

TIMEOUT or CANCELLED based on SLAs

Transport

Subject Conventions

CAP is transport-agnostic but provides recommended subject mappings. NATS is the primary transport; Kafka profiles are also supported.

SubjectPurpose
sys.job.submitJob submission by gateways
sys.job.resultJob results from workers
sys.job.progressJob progress updates
sys.job.dlqDead letter queue entries
sys.job.cancelJob cancellation signals
sys.heartbeatWorker liveness (no queue groups)
sys.workflow.eventWorkflow engine events
job.<pool>Work distribution with queue groups
worker.<id>.jobsDirect worker dispatch
NATS profile: Use queue groups for pool subjects. Disable queue groups for sys.heartbeat. Enable JetStream for durability on sys.job.submit and sys.job.result. At-least-once delivery assumed — workers and schedulers must be idempotent.
Reliability

Compensation & Rollback

JobRequest includes an optional compensation template — an inverse action dispatched on workflow rollback. Orchestrators log compensations after success and dispatch them in LIFO order on failure.

  • FAILED_RETRYABLE for transient errors (rate limits, network)
  • FAILED_FATAL triggers saga rollback with compensation stack
  • idempotency_key for durable re-entry and deduplication

Tracing & Workflows

CAP supports hierarchical orchestration via workflow metadata fields. trace_id remains stable across entire workflow trees for end-to-end observability.

  • workflow_id — DAG identifier
  • parent_job_id — Parent in the tree
  • step_index — Position in DAG
  • trace_id — Stable across entire workflow
SDKs

Multi-Language SDK Support

CAP SDKs provide typed handlers, runtime helpers, and Redis pointer hydration. Current protocol generation: CAP v2.

Go
Production SDK

Full runtime SDK with pointer hydration and typed handlers

Python
Example Worker

Reference worker implementation in examples/python-worker

Node / TypeScript
Example Worker

Reference worker implementation in examples/node-worker

Comparison

CAP vs MCP

CAP — Cordum Agent Protocol
  • Distributed multi-agent control plane
  • Job lifecycle, scheduling, policy enforcement
  • Operates across clusters and nodes
  • Protobuf wire contract with typed payloads
MCP — Model Context Protocol
  • Single-model tool calling protocol
  • Tool discovery, invocation, and results
  • Operates within a single model session
  • Can be the tool layer inside a CAP worker

Frequently Asked Questions

What is the Cordum Agent Protocol (CAP)?
CAP is a language-agnostic wire protocol for AI agent communication and governance. It defines a BusPacket envelope with typed payloads (JobRequest, JobResult, Heartbeat, JobCancel) and context pointers, enabling deterministic execution across distributed agent systems.
How does CAP compare to MCP (Model Context Protocol)?
MCP is a single-model tool protocol that connects one LLM to local tools and data sources. CAP is a distributed multi-agent protocol for cluster-level governance. MCP can serve as the tool layer inside a CAP worker — they are complementary, not competing.
What languages have CAP SDKs?
CAP has a production Go SDK (sdk/runtime) with full pointer hydration and typed handlers. Python and Node/TypeScript have reference worker implementations in the examples/ directory.
Why does CAP use pointers instead of inline payloads?
Large payloads (context, results, artifacts) are stored in Redis and referenced by pointer (e.g., redis://ctx/job-123). Only the pointer travels on the NATS bus, keeping messages small and enabling efficient routing without payload copying.