Documentation

Architecture

Production-grade control plane with intelligent scheduling, policy-before-dispatch, and full observability. Built on NATS + Redis with CAP v2 protocol.

System Diagram

Clients/UI
  │
  ▼
API Gateway (HTTP/WS + gRPC)
  │ writes ctx/res/artifact pointers
  ▼
Redis (state, config, DLQ)
  │
  ▼
NATS bus (sys.* + job.*)
  │
  ├──▶ Scheduler (safety gate + routing)
  │       │
  │       └──▶ Safety Kernel (gRPC policy check)
  │
  ├──▶ External Workers
  │
  └──▶ Workflow Engine (run orchestration)

Concept: The Agency-Control Tradeoff

Premise

Raw agents (ReAct loops) are non-deterministic and hard to audit. Every LLM call produces probabilistic output — fine for suggestions, dangerous for production actions.

Solution

Cordum decouples the Reasoning Loop (LLM) from the Execution Loop (Safety Kernel). The LLM decides what to do. Cordum decides if it is allowed, then executes deterministically.

Outcome

Deterministic, auditable production AI. Every action is policy-evaluated, every decision is recorded, every workflow is reproducible. Agents get autonomy; operators get control.

Core Components

API Gateway

HTTP/WS + gRPC for jobs, workflows, approvals, config.

Scheduler

Safety gate, routing, job state in Redis, retries, DLQ.

Safety Kernel

Check, evaluate, explain, and simulate policy decisions.

Workflow Engine

DAG orchestration, approvals, retries, and run timeline.

Context Engine

Optional gRPC memory service backed by Redis.

External Workers

Subscribe to job topics and publish results.

Bus Subjects

sys.job.submit
sys.job.result
sys.job.progress
sys.job.dlq
sys.job.cancel
sys.heartbeat
sys.workflow.event
job.* (pool subjects)
worker.<id>.jobs (direct)

Redis Keys

ctx:<job_id> and res:<job_id> pointers
job:meta:<job_id> and job indexes
wf:def:<workflow_id> and wf:run:<run_id>
wf:run:timeline:<run_id>
saga:<workflow_id>:stack (compensation LIFO)
cfg:system:* config and policy bundles
dlq:entry:<job_id> and dlq:index

Intelligent Scheduling

The scheduler uses least-loaded worker selection with capability-based pool routing and overload detection.

Scheduling Algorithm

// Least-loaded score calculation
score = active_jobs + cpu_load/100 + gpu_utilization/100

// Worker selection priority:
1. Direct dispatch to worker.<id>.jobs (if preferred_worker_id set)
2. Capability filtering (requires: kubectl, GPU, network)
3. Lowest score worker in eligible pool
4. Fallback to topic queue (job.*) if no workers available

// Overload detection:
if (active_jobs > max_parallel_jobs * 0.9) {
  reason = "pool_overloaded"
  // Scheduler waits or DLQs
}

Scheduler Features

•
Direct worker routing: Low-latency dispatch to worker.<id>.jobs for specific workers
•
Capability filtering: Jobs with requires=[kubectl, GPU] only go to capable workers
•
Label hints: preferred_pool and preferred_worker_id for affinity
•
Reconciler: Marks stale DISPATCHED/RUNNING jobs as TIMEOUT based on config/timeouts.yaml
•
Pending replayer: Retries stuck PENDING jobs past dispatch timeout to avoid deadlocks

Job Lifecycle

Every job follows a deterministic state machine with policy evaluation before dispatch.

Job State Transitions

1. Client writes input JSON to Redis at ctx:<job_id>
2. Publishes BusPacket{JobRequest} to sys.job.submit
3. Scheduler sets state PENDING
4. Scheduler calls Safety Kernel for policy evaluation
   • ALLOW → proceed to dispatch
   • DENY → state DENIED, create DLQ entry
   • REQUIRE_APPROVAL → state APPROVAL_REQUIRED, wait
   • ALLOW_WITH_CONSTRAINTS → apply constraints, proceed
   • THROTTLE → retry after backoff delay (rate limiting)
   • UNAVAILABLE → Safety Kernel unreachable, use last-known-good policy
5. Scheduler picks subject and dispatches:
   PENDING → SCHEDULED → DISPATCHED → RUNNING
6. Worker loads context, executes, writes res:<job_id>
7. Worker publishes BusPacket{JobResult} to sys.job.result
8. Scheduler updates terminal state:
   • SUCCEEDED (exit 0)
   • FAILED (worker reported failure)
   • CANCELLED (sys.job.cancel received)
   • TIMEOUT (reconciler marks stale jobs)
9. DLQ entry created for non-SUCCEEDED results

Note: JobResult also supports FAILED_RETRYABLE and
FAILED_FATAL status codes for worker-level failure
categorization.

State Machine Details

PENDING

Job received, awaiting policy check and dispatch

APPROVAL_REQUIRED

Policy requires human approval before dispatch (bound to policy snapshot + job hash)

SCHEDULED

Policy check passed, awaiting worker assignment

DISPATCHED

Sent to worker subject, awaiting worker pickup

RUNNING

Worker executing the job

SUCCEEDED

Job completed successfully (exit 0)

FAILED

Worker reported failure → DLQ entry created

TIMEOUT

Job exceeded timeout, marked by reconciler → DLQ entry

CANCELLED

Cancelled via sys.job.cancel → DLQ entry

DENIED

Blocked by Safety Kernel → DLQ entry with policy reason

Note: JobResult also supports FAILED_RETRYABLE and FAILED_FATAL status codes for worker-level failure categorization.

Component Details

Each core service runs as a separate binary and communicates via NATS + Redis.

API Gateway (cordum-api-gateway)

HTTP/WebSocket/gRPC interface for jobs, workflows, runs, approvals, policy, DLQ, schemas, locks, artifacts.

→ Streams real-time bus events over /api/v1/stream (WebSocket)
→ API key authentication with per-tenant scoping
→ RBAC enforcement (viewer, admin) when enterprise license present
→ Cursor pagination with rich filters (state, topic, tenant, time, trace)

Scheduler (cordum-scheduler)

Subscribes to sys.job.submit, calls Safety Kernel, routes via least-loaded strategy, persists state in Redis.

→ Policy-before-dispatch: no job executes without Safety Kernel approval
→ Pool mapping from config service (config/pools.yaml bootstrapped on startup)
→ Reconciler scans for stale jobs and marks TIMEOUT based on config/timeouts.yaml
→ Pending replayer retries old PENDING jobs to avoid stuck runs
→ Per-job locks for idempotency with JetStream at-least-once delivery
→ Saga rollback: compensation stack in saga:<workflow_id>:stack, FAILED_FATAL triggers undo dispatch

Safety Kernel (cordum-safety-kernel)

gRPC service with Check, Evaluate, Explain, Simulate methods. Policy-as-code with snapshot versioning.

→ Returns ALLOW / DENY / REQUIRE_APPROVAL / ALLOW_WITH_CONSTRAINTS / THROTTLE / UNAVAILABLE
→ Loads policy from config/safety.yaml + config service bundles (cfg:system:policy)
→ Hot reload with snapshot hashing and rollback capability
→ Last-known-good fallback if verification fails
→ Decision audit record for every request (rule_id, version, decision, reason)

Workflow Engine (cordum-workflow-engine)

Stores workflow definitions and runs in Redis. Dispatches ready steps as jobs. Maintains append-only run timeline.

→ 16 step types across 5 categories: execution (worker, llm, http, container, script), control flow, gates, data, and composition
→ DAG execution with depends_on: independent steps run in parallel
→ Failed/cancelled/timed-out deps block downstream (no implicit continue-on-error)
→ Schema validation for workflow input and step I/O
→ Rerun-from-step, dry-run mode, idempotency keys on run creation

CAP v2 Protocol

Cordum Automation Protocol (CAP v2) is the wire contract from github.com/cordum-io/cap/v2/cordum/agent/v1. See full CAP Protocol documentation →

BusPacket Envelope

message BusPacket {
  string trace_id = 1;          // request trace ID
  string sender_id = 2;         // component or worker ID
  google.protobuf.Timestamp created_at = 3;
  int32 protocol_version = 4;   // currently 1

  oneof payload {
    JobRequest job_request = 10;
    JobResult job_result = 11;
    Heartbeat heartbeat = 12;
    JobProgress job_progress = 13;
    JobCancel job_cancel = 14;
    SystemAlert system_alert = 15;
  }

  bytes signature = 99;         // optional, not enforced yet
}

Job Metadata

Jobs include metadata for policy evaluation, routing, and observability:

tenant_id - Multi-tenancy isolation
actor_id - Human or service actor
capability - Semantic action label (e.g., "sre.patch.apply")
risk_tags - Policy hints (prod, write, network, secrets, exec)
requires - Capabilities for routing (kubectl, GPU, network)
pack_id - Originating pack for observability
idempotency_key - Dedupe key for retries
labels - Free-form routing + observability hints

Saga Add-ons in CAP

CAP includes explicit primitives for rollback, progress, and failure semantics.

JobRequest.compensation - Inverse action template for rollback.
Heartbeat.progress_pct / last_memo - Checkpoint heartbeats for long tasks.
FAILED_RETRYABLE - Transient failure (safe to retry).
FAILED_FATAL - Non-recoverable failure (trigger rollback).

Output safety layer

In addition to pre-dispatch input policy checks, the scheduler can run output-safety checks on job results before release. Output decisions can allow, redact, or quarantine results and persist decision metadata for API/dashboard retrieval. See output safety details →

Production Features

Built-in durability, idempotency, and failure handling for production workloads.

JetStream Durability

Optional at-least-once delivery for durable subjects when NATS_USE_JETSTREAM=1:

✓ sys.job.submit, sys.job.result, sys.job.dlq
✓ job.* and worker.<id>.jobs
— sys.heartbeat (best-effort fan-out)
— sys.job.cancel (best-effort)

Idempotency

Handlers use Redis locks to handle at-least-once delivery:

→ Scheduler acquires per-job lock before state mutation
→ Workers cache published JobResult metadata
→ Workflow runs support idempotency_key on creation
→ Retryable errors NAK with delay; non-retryable ACK

Resource Locks

Distributed locks prevent concurrent mutations:

→ Shared/exclusive modes with TTL and owner tracking
→ Pack install acquires packs:global + pack:<id> locks
→ Workflows can hold locks on resources (repo, cluster, service)
→ Gateway APIs: acquire, release, renew, get lock status

Dead Letter Queue

All non-SUCCEEDED jobs go to DLQ for observability:

→ Stores error_code, error_message, last_state, attempts
→ Gateway APIs: list, get, retry (creates new job), delete
→ DENIED jobs include policy reason and matched rule_id
→ Indexed by job_id and sortable by created_at for debugging

Saga Rollback

Durable rollback for workflows with compensation templates:

→ Completed compensations stored in saga:<workflow_id>:stack
→ FAILED_FATAL triggers reverse-order undo dispatch
→ Compensation jobs dispatched at CRITICAL priority
→ Metrics track rollback duration and failures

Frequently Asked Questions

What is Cordum?

Cordum is an Agent Control Plane for autonomous AI agents. It sits between agent runtimes and production infrastructure, enforcing policy, approvals, and auditability before actions execute.

How does Cordum differ from LangChain or CrewAI?

LangChain and CrewAI are agent frameworks that control how agents think and chain tasks. Cordum is a control plane that governs where and if agents can act — enforcing policies before execution, not after. You can use agent frameworks with Cordum.

What message bus does Cordum use?

Cordum uses NATS as its primary message bus with optional JetStream for durable, at-least-once delivery. The system supports two streams: CORDUM_SYS for system events and CORDUM_JOBS for job distribution.

Is Cordum open source?

Cordum is source-available under BUSL-1.1 (free for self-hosted/internal use). Enterprise features such as SSO/SAML, advanced RBAC, and audit export are delivered from the separate enterprise repo under commercial licensing.