Documentation

The Cordum Agent Protocol (CAP) Specification

CAP is a state-driven protocol for ensuring deterministic agent execution, offering a safer alternative to unstructured ReAct loops. It defines the wire contract between AI agents and the Cordum control plane.

A distributed, open-source wire contract for AI agent job lifecycle. Protocol-first via protobuf. Standardized envelopes, typed payloads, and opaque pointers — so schedulers, workers, orchestrators, and gateways can interoperate without custom glue.

Core Principles

Protocol-First

Defined via protobuf contracts. These pages define semantics and required behavior for compatibility. Language-agnostic by design.

Payload Off the Bus

Large data stays in external memory. Only opaque pointers travel on the wire, keeping the bus fast and messages small.

Safety as First-Class Hook

Every job is evaluated by the Safety Kernel before dispatch. Policy enforcement is built into the protocol, not bolted on after the fact.

Transport Agnostic

Works with NATS (primary), Kafka, or any pub/sub with subjects and queue groups. The protocol defines semantics, not transport.

Envelope

BusPacket

All CAP traffic is wrapped in a BusPacket envelope. It provides tracing, sender identity, and protocol negotiation around a single typed payload.

buspacket.proto

message BusPacket {
  string trace_id = 1;
  string sender_id = 2;
  google.protobuf.Timestamp created_at = 3;
  int32 protocol_version = 4;

  oneof payload {
    JobRequest  job_request  = 10;
    JobResult   job_result   = 11;
    Heartbeat   heartbeat    = 12;
    JobProgress job_progress = 13;
    JobCancel   job_cancel   = 14;
    SystemAlert system_alert = 15;
  }

  bytes signature = 99; // optional digital signature
}

Required Fields

trace_idCorrelates all packets for a request or workflow
sender_idStable identifier for the emitting component
created_atTimestamp of emission (UTC)
protocol_versionCAP wire version for negotiation
payloadExactly one of the six typed message types
signatureOptional digital signature for authenticity

Message Types

JobRequest

Job submission with job_id, topic, priority, context_ptr, budget, meta (capability, risk_tags, requires), and compensation template.

JobResult

Job completion with job_id, status, result_ptr, worker_id, execution_ms, error_code, and artifact_ptrs.

Heartbeat

Worker liveness with worker_id, region, type, cpu/gpu load, active_jobs, capabilities, pool, and max_parallel_jobs.

JobProgress

Checkpoint with percent complete, message, and partial result_ptr for long-running tasks.

JobCancel

Cancellation signal with reason and requested_by for graceful job termination.

SystemAlert

Alerts with level, message, component, and code for operational monitoring.

Job Metadata

Structured Identity and Routing

Every JobRequest carries structured metadata for policy evaluation, routing, and observability:

tenant_id — Multi-tenancy isolation
actor_id — Human or service actor
capability — Semantic action label (e.g., "sre.patch.apply")
risk_tags — Policy hints (prod, write, network, secrets)
requires — Capabilities for routing (kubectl, GPU, network)
pack_id — Originating pack for observability
idempotency_key — Dedupe key for retries
labels — Free-form routing and observability hints

Pointer System

Payloads Stay Off the Bus

CAP keeps large payloads off the bus by referencing external memory through opaque URI pointers. Pointers are stable and immutable for the lifetime of the job.

context_ptr

Location of input payload written by the gateway or client.

redis://ctx/job-123

result_ptr

Location of output payload written by the worker.

redis://res/job-123

redacted_context_ptr

Sanitized input produced by the Safety Kernel on deny or throttle.

redis://ctx/job-123:redacted

Format: Pointers are opaque URIs — redis://, s3://, https://. Consumers treat them as opaque; dereferencing is implementation-specific. Gateways set TTL on context, workers set TTL on results.

Safety Integration

First-Class Policy Hook

CAP makes safety a first-class control-plane hook. The Safety Kernel is called before every job dispatch, returning a decision with reason, constraints, and optional redacted context.

safety.proto

service SafetyKernel {
  rpc Check(PolicyCheckRequest)
      returns (PolicyCheckResponse);
  rpc Evaluate(PolicyCheckRequest)
      returns (PolicyCheckResponse);
  rpc Explain(PolicyCheckRequest)
      returns (PolicyCheckResponse);
  rpc Simulate(PolicyCheckRequest)
      returns (PolicyCheckResponse);
  rpc ListSnapshots(ListSnapshotsRequest)
      returns (ListSnapshotsResponse);
}

Policy Decisions

AllowJob proceeds immediately
DenyJob rejected, reason logged
Require HumanPaused for out-of-band approval
ThrottleDelayed and retried after backoff
ConstrainAllowed with enforced limits
UnavailableKernel unreachable, falls back to last-known-good policy

Performance: Safety checks complete in < 5ms. On outage, schedulers fail-closed by default. Every decision is logged with trace_id, job_id, decision, and reason.

State Machine

Job Lifecycle States

CAP standardizes job lifecycle states to keep schedulers and workers interoperable. Transitions are append-only — backwards transitions are rejected.

PENDING

APPROVAL_REQUIRED

SCHEDULED

DISPATCHED

RUNNING

SUCCEEDED

FAILED

TIMEOUT

CANCELLED

DENIED

→ PENDING → APPROVAL_REQUIRED (if policy requires human approval)

→ PENDING → SCHEDULED → DISPATCHED → RUNNING → SUCCEEDED / FAILED

→ DISPATCHED / RUNNING → TIMEOUT (reconciler marks stale)

→ PENDING → DENIED (safety rejects)

→ Any non-terminal → CANCELLED

Note: JobResult also supports FAILED_RETRYABLE and FAILED_FATAL status codes for worker-level failure categorization.

Gateway

Sets PENDING, publishes to sys.job.submit

Scheduler

PENDING → SCHEDULED → DISPATCHED, calls safety

Worker

DISPATCHED → RUNNING → SUCCEEDED / FAILED

Reconciler

TIMEOUT or CANCELLED based on SLAs

Transport

Subject Conventions

CAP is transport-agnostic but provides recommended subject mappings. NATS is the primary transport; Kafka profiles are also supported.

Subject	Purpose
`sys.job.submit`	Job submission by gateways
`sys.job.result`	Job results from workers
`sys.job.progress`	Job progress updates
`sys.job.dlq`	Dead letter queue entries
`sys.job.cancel`	Job cancellation signals
`sys.heartbeat`	Worker liveness (no queue groups)
`sys.workflow.event`	Workflow engine events
`job.<pool>`	Work distribution with queue groups
`worker.<id>.jobs`	Direct worker dispatch

NATS profile: Use queue groups for pool subjects. Disable queue groups for sys.heartbeat. Enable JetStream for durability on sys.job.submit and sys.job.result. At-least-once delivery assumed — workers and schedulers must be idempotent.

Reliability

Compensation & Rollback

JobRequest includes an optional compensation template — an inverse action dispatched on workflow rollback. Orchestrators log compensations after success and dispatch them in LIFO order on failure.

FAILED_RETRYABLE for transient errors (rate limits, network)
FAILED_FATAL triggers saga rollback with compensation stack
idempotency_key for durable re-entry and deduplication

Tracing & Workflows

CAP supports hierarchical orchestration via workflow metadata fields. trace_id remains stable across entire workflow trees for end-to-end observability.

workflow_id — DAG identifier
parent_job_id — Parent in the tree
step_index — Position in DAG
trace_id — Stable across entire workflow

SDKs

Multi-Language SDK Support

CAP SDKs provide typed handlers, runtime helpers, and Redis pointer hydration. Current protocol generation: CAP v2.

Production SDK

Full runtime SDK with pointer hydration and typed handlers

Python

Example Worker

Reference worker implementation in examples/python-worker

Node / TypeScript

Example Worker

Reference worker implementation in examples/node-worker

Comparison

CAP vs MCP

CAP — Cordum Agent Protocol

Distributed multi-agent control plane
Job lifecycle, scheduling, policy enforcement
Operates across clusters and nodes
Protobuf wire contract with typed payloads

MCP — Model Context Protocol

Single-model tool calling protocol
Tool discovery, invocation, and results
Operates within a single model session
Can be the tool layer inside a CAP worker

Frequently Asked Questions

What is the Cordum Agent Protocol (CAP)?

CAP is a language-agnostic wire protocol for AI agent communication and governance. It defines a BusPacket envelope with typed payloads (JobRequest, JobResult, Heartbeat, JobCancel) and context pointers, enabling deterministic execution across distributed agent systems.

How does CAP compare to MCP (Model Context Protocol)?

MCP is a single-model tool protocol that connects one LLM to local tools and data sources. CAP is a distributed multi-agent protocol for cluster-level governance. MCP can serve as the tool layer inside a CAP worker — they are complementary, not competing.

What languages have CAP SDKs?

CAP has a production Go SDK (sdk/runtime) with full pointer hydration and typed handlers. Python and Node/TypeScript have reference worker implementations in the examples/ directory.

Why does CAP use pointers instead of inline payloads?

Large payloads (context, results, artifacts) are stored in Redis and referenced by pointer (e.g., redis://ctx/job-123). Only the pointer travels on the NATS bus, keeping messages small and enabling efficient routing without payload copying.

View CAP on GitHub Architecture Overview