Architecture
Production-grade control plane with intelligent scheduling, policy-before-dispatch, and full observability. Built on NATS + Redis with CAP v2 protocol.
Clients/UI │ ▼ API Gateway (HTTP/WS + gRPC) │ writes ctx/res/artifact pointers ▼ Redis (state, config, DLQ) │ ▼ NATS bus (sys.* + job.*) │ ├──▶ Scheduler (safety gate + routing) │ │ │ └──▶ Safety Kernel (gRPC policy check) │ ├──▶ External Workers │ └──▶ Workflow Engine (run orchestration)
Concept: The Agency-Control Tradeoff
Raw agents (ReAct loops) are non-deterministic and hard to audit. Every LLM call produces probabilistic output — fine for suggestions, dangerous for production actions.
Cordum decouples the Reasoning Loop (LLM) from the Execution Loop (Safety Kernel). The LLM decides what to do. Cordum decides if it is allowed, then executes deterministically.
Deterministic, auditable production AI. Every action is policy-evaluated, every decision is recorded, every workflow is reproducible. Agents get autonomy; operators get control.
HTTP/WS + gRPC for jobs, workflows, approvals, config.
Safety gate, routing, job state in Redis, retries, DLQ.
Check, evaluate, explain, and simulate policy decisions.
DAG orchestration, approvals, retries, and run timeline.
Optional gRPC memory service backed by Redis.
Subscribe to job topics and publish results.
sys.job.submitsys.job.resultsys.job.progresssys.job.dlqsys.job.cancelsys.heartbeatsys.workflow.eventjob.* (pool subjects)worker.<id>.jobs (direct)
- ctx:<job_id> and res:<job_id> pointers
- job:meta:<job_id> and job indexes
- wf:def:<workflow_id> and wf:run:<run_id>
- wf:run:timeline:<run_id>
- saga:<workflow_id>:stack (compensation LIFO)
- cfg:system:* config and policy bundles
- dlq:entry:<job_id> and dlq:index
Intelligent Scheduling
The scheduler uses least-loaded worker selection with capability-based pool routing and overload detection.
// Least-loaded score calculation
score = active_jobs + cpu_load/100 + gpu_utilization/100
// Worker selection priority:
1. Direct dispatch to worker.<id>.jobs (if preferred_worker_id set)
2. Capability filtering (requires: kubectl, GPU, network)
3. Lowest score worker in eligible pool
4. Fallback to topic queue (job.*) if no workers available
// Overload detection:
if (active_jobs > max_parallel_jobs * 0.9) {
reason = "pool_overloaded"
// Scheduler waits or DLQs
}Scheduler Features
- •Direct worker routing: Low-latency dispatch to worker.<id>.jobs for specific workers
- •Capability filtering: Jobs with requires=[kubectl, GPU] only go to capable workers
- •Label hints: preferred_pool and preferred_worker_id for affinity
- •Reconciler: Marks stale DISPATCHED/RUNNING jobs as TIMEOUT based on config/timeouts.yaml
- •Pending replayer: Retries stuck PENDING jobs past dispatch timeout to avoid deadlocks
Job Lifecycle
Every job follows a deterministic state machine with policy evaluation before dispatch.
1. Client writes input JSON to Redis at ctx:<job_id>
2. Publishes BusPacket{JobRequest} to sys.job.submit
3. Scheduler sets state PENDING
4. Scheduler calls Safety Kernel for policy evaluation
• ALLOW → proceed to dispatch
• DENY → state DENIED, create DLQ entry
• REQUIRE_APPROVAL → state APPROVAL_REQUIRED, wait
• ALLOW_WITH_CONSTRAINTS → apply constraints, proceed
• THROTTLE → retry after backoff delay (rate limiting)
• UNAVAILABLE → Safety Kernel unreachable, use last-known-good policy
5. Scheduler picks subject and dispatches:
PENDING → SCHEDULED → DISPATCHED → RUNNING
6. Worker loads context, executes, writes res:<job_id>
7. Worker publishes BusPacket{JobResult} to sys.job.result
8. Scheduler updates terminal state:
• SUCCEEDED (exit 0)
• FAILED (worker reported failure)
• CANCELLED (sys.job.cancel received)
• TIMEOUT (reconciler marks stale jobs)
9. DLQ entry created for non-SUCCEEDED results
Note: JobResult also supports FAILED_RETRYABLE and
FAILED_FATAL status codes for worker-level failure
categorization.State Machine Details
PENDINGJob received, awaiting policy check and dispatch
APPROVAL_REQUIREDPolicy requires human approval before dispatch (bound to policy snapshot + job hash)
SCHEDULEDPolicy check passed, awaiting worker assignment
DISPATCHEDSent to worker subject, awaiting worker pickup
RUNNINGWorker executing the job
SUCCEEDEDJob completed successfully (exit 0)
FAILEDWorker reported failure → DLQ entry created
TIMEOUTJob exceeded timeout, marked by reconciler → DLQ entry
CANCELLEDCancelled via sys.job.cancel → DLQ entry
DENIEDBlocked by Safety Kernel → DLQ entry with policy reason
FAILED_RETRYABLE and FAILED_FATAL status codes for worker-level failure categorization.Component Details
Each core service runs as a separate binary and communicates via NATS + Redis.
API Gateway (cordum-api-gateway)
HTTP/WebSocket/gRPC interface for jobs, workflows, runs, approvals, policy, DLQ, schemas, locks, artifacts.
- → Streams real-time bus events over /api/v1/stream (WebSocket)
- → API key authentication with per-tenant scoping
- → RBAC enforcement (viewer, admin) when enterprise license present
- → Cursor pagination with rich filters (state, topic, tenant, time, trace)
Scheduler (cordum-scheduler)
Subscribes to sys.job.submit, calls Safety Kernel, routes via least-loaded strategy, persists state in Redis.
- → Policy-before-dispatch: no job executes without Safety Kernel approval
- → Pool mapping from config service (config/pools.yaml bootstrapped on startup)
- → Reconciler scans for stale jobs and marks TIMEOUT based on config/timeouts.yaml
- → Pending replayer retries old PENDING jobs to avoid stuck runs
- → Per-job locks for idempotency with JetStream at-least-once delivery
- → Saga rollback: compensation stack in saga:<workflow_id>:stack, FAILED_FATAL triggers undo dispatch
Safety Kernel (cordum-safety-kernel)
gRPC service with Check, Evaluate, Explain, Simulate methods. Policy-as-code with snapshot versioning.
- → Returns ALLOW / DENY / REQUIRE_APPROVAL / ALLOW_WITH_CONSTRAINTS / THROTTLE / UNAVAILABLE
- → Loads policy from config/safety.yaml + config service bundles (cfg:system:policy)
- → Hot reload with snapshot hashing and rollback capability
- → Last-known-good fallback if verification fails
- → Decision audit record for every request (rule_id, version, decision, reason)
Workflow Engine (cordum-workflow-engine)
Stores workflow definitions and runs in Redis. Dispatches ready steps as jobs. Maintains append-only run timeline.
- → 16 step types across 5 categories: execution (worker, llm, http, container, script), control flow, gates, data, and composition
- → DAG execution with depends_on: independent steps run in parallel
- → Failed/cancelled/timed-out deps block downstream (no implicit continue-on-error)
- → Schema validation for workflow input and step I/O
- → Rerun-from-step, dry-run mode, idempotency keys on run creation
CAP v2 Protocol
Cordum Automation Protocol (CAP v2) is the wire contract from github.com/cordum-io/cap/v2/cordum/agent/v1. See full CAP Protocol documentation →
message BusPacket {
string trace_id = 1; // request trace ID
string sender_id = 2; // component or worker ID
google.protobuf.Timestamp created_at = 3;
int32 protocol_version = 4; // currently 1
oneof payload {
JobRequest job_request = 10;
JobResult job_result = 11;
Heartbeat heartbeat = 12;
JobProgress job_progress = 13;
JobCancel job_cancel = 14;
SystemAlert system_alert = 15;
}
bytes signature = 99; // optional, not enforced yet
}Job Metadata
Jobs include metadata for policy evaluation, routing, and observability:
tenant_id- Multi-tenancy isolationactor_id- Human or service actorcapability- Semantic action label (e.g., "sre.patch.apply")risk_tags- Policy hints (prod, write, network, secrets, exec)requires- Capabilities for routing (kubectl, GPU, network)pack_id- Originating pack for observabilityidempotency_key- Dedupe key for retrieslabels- Free-form routing + observability hints
Saga Add-ons in CAP
CAP includes explicit primitives for rollback, progress, and failure semantics.
JobRequest.compensation- Inverse action template for rollback.Heartbeat.progress_pct/last_memo- Checkpoint heartbeats for long tasks.FAILED_RETRYABLE- Transient failure (safe to retry).FAILED_FATAL- Non-recoverable failure (trigger rollback).
Output safety layer
In addition to pre-dispatch input policy checks, the scheduler can run output-safety checks on job results before release. Output decisions can allow, redact, or quarantine results and persist decision metadata for API/dashboard retrieval. See output safety details →
Production Features
Built-in durability, idempotency, and failure handling for production workloads.
JetStream Durability
Optional at-least-once delivery for durable subjects when NATS_USE_JETSTREAM=1:
- ✓ sys.job.submit, sys.job.result, sys.job.dlq
- ✓ job.* and worker.<id>.jobs
- — sys.heartbeat (best-effort fan-out)
- — sys.job.cancel (best-effort)
Idempotency
Handlers use Redis locks to handle at-least-once delivery:
- → Scheduler acquires per-job lock before state mutation
- → Workers cache published JobResult metadata
- → Workflow runs support idempotency_key on creation
- → Retryable errors NAK with delay; non-retryable ACK
Resource Locks
Distributed locks prevent concurrent mutations:
- → Shared/exclusive modes with TTL and owner tracking
- → Pack install acquires packs:global + pack:<id> locks
- → Workflows can hold locks on resources (repo, cluster, service)
- → Gateway APIs: acquire, release, renew, get lock status
Dead Letter Queue
All non-SUCCEEDED jobs go to DLQ for observability:
- → Stores error_code, error_message, last_state, attempts
- → Gateway APIs: list, get, retry (creates new job), delete
- → DENIED jobs include policy reason and matched rule_id
- → Indexed by job_id and sortable by created_at for debugging
Saga Rollback
Durable rollback for workflows with compensation templates:
- → Completed compensations stored in saga:<workflow_id>:stack
- → FAILED_FATAL triggers reverse-order undo dispatch
- → Compensation jobs dispatched at CRITICAL priority
- → Metrics track rollback duration and failures
