Skip to content
Documentation

Configuration Reference

Cordum uses environment variables and YAML config files to configure every service. Config files are validated against embedded JSON schemas at startup — invalid configs return errors and fall back to defaults.

Environment Variables

Shared (all services)

VariableDescriptionDefault
NATS_URLNATS server connection URLnats://nats:4222
REDIS_URLRedis connection URLredis://redis:6379
NATS_USE_JETSTREAMEnable JetStream durable delivery (0 or 1)0
CORDUM_ENVSet to production for strict security defaults—
CORDUM_PRODUCTIONAlternative production flag (true/false)—
CORDUM_LOG_FORMATLog output formattext
CORDUM_TLS_MIN_VERSIONMinimum TLS version (1.2 or 1.3)1.3 in prod
CORDUM_GRPC_REFLECTIONEnable gRPC reflection (dev only, set to 1)—
POOL_CONFIG_PATHPath to pools.yaml—
TIMEOUT_CONFIG_PATHPath to timeouts.yaml—
SAFETY_KERNEL_ADDRgRPC address of the Safety Kernel—

TLS & Redis Clustering

VariableDescriptionDefault
NATS_TLS_CA / CERT / KEYNATS client TLS certificate chain—
NATS_TLS_INSECURESkip NATS TLS verification (dev only)—
NATS_TLS_SERVER_NAMEExpected NATS server CN for verification—
REDIS_TLS_CA / CERT / KEYRedis client TLS certificate chain—
REDIS_TLS_INSECURESkip Redis TLS verification (dev only)—
REDIS_CLUSTER_ADDRESSESComma-separated host:port seeds for Redis Cluster—

API Gateway

Core
VariableDescriptionDefault
GATEWAY_HTTP_ADDRHTTP listen address:8081
GATEWAY_GRPC_ADDRgRPC listen address:8080
GATEWAY_WS_METRICS_ADDRWebSocket and Prometheus metrics address:9092
CORDUM_API_KEYSingle API key for authentication—
CORDUM_API_KEYSComma-separated or JSON array of API keys—
CORDUM_API_KEYS_PATHFile path to API keys (hot-reloads on change)—
CORDUM_ALLOWED_ORIGINSCORS allowed origins (also CORDUM_CORS_ALLOW_ORIGINS)—
TENANT_IDDefault tenant ID for single-tenant mode—
API_RATE_LIMIT_RPSRate limit requests per second (per tenant)—
API_RATE_LIMIT_BURSTRate limit burst size (per tenant)—
ARTIFACT_MAX_BYTESMax artifact upload/download size—
CORDUM_ALLOW_INSECURE_NO_AUTHAllow anonymous auth (dev only, set to 1)—
CORDUM_ALLOW_HEADER_PRINCIPALTrust X-Principal header (disabled in prod)—
TLS
VariableDescriptionDefault
GATEWAY_HTTP_TLS_CERT / KEYHTTP server TLS certificate and key—
GRPC_TLS_CERT / KEYgRPC server TLS certificate and key—
JWT Authentication
VariableDescriptionDefault
CORDUM_JWT_HMAC_SECRETHMAC secret for JWT verification—
CORDUM_JWT_PUBLIC_KEYRSA/ECDSA public key (inline PEM)—
CORDUM_JWT_PUBLIC_KEY_PATHPath to public key file—
CORDUM_JWT_ISSUERExpected JWT issuer claim—
CORDUM_JWT_AUDIENCEExpected JWT audience claim—
CORDUM_JWT_DEFAULT_ROLEDefault role for JWT-authenticated users—
CORDUM_JWT_REQUIREDRequire JWT auth on all requests—
User Authentication
VariableDescriptionDefault
CORDUM_USER_AUTH_ENABLEDEnable user/password auth (stores users in Redis)—
CORDUM_ADMIN_USERNAMEDefault admin usernameadmin
CORDUM_ADMIN_PASSWORDDefault admin password (created on first startup)—
CORDUM_ADMIN_EMAILOptional admin email—

Scheduler

VariableDescriptionDefault
JOB_META_TTLTTL for job metadata in Redis (also JOB_META_TTL_SECONDS)—
WORKER_SNAPSHOT_INTERVALInterval for worker snapshot updates—
SCHEDULER_CONFIG_RELOAD_INTERVALConfig overlay reload interval30s
NATS_JS_ACK_WAITJetStream acknowledgment wait timeout—
NATS_JS_MAX_AGEJetStream message max age—
NATS_JS_REPLICASJetStream stream replication factor—
SCHEDULER_METRICS_ADDRPrometheus metrics address:9090

Safety Kernel

VariableDescriptionDefault
SAFETY_KERNEL_ADDRgRPC listen address for the Safety Kernel—
SAFETY_POLICY_PATHPath to safety.yaml policy file—
SAFETY_POLICY_URLURL to fetch policy (alternative to file path)—
SAFETY_POLICY_URL_ALLOWLISTComma-separated allowed hostnames for policy URLs—
SAFETY_DECISION_CACHE_TTLDecision cache TTL0 (disabled)
SAFETY_POLICY_RELOAD_INTERVALInterval for policy file hot reload—
SAFETY_POLICY_PUBLIC_KEYEd25519 public key for policy signature verification—
SAFETY_POLICY_SIGNATURE_REQUIREDRequire signed policy files (true/false)—
SAFETY_POLICY_CONFIG_SCOPEConfig-service scope for policy fragments—
SAFETY_POLICY_CONFIG_DISABLEDisable config-service policy overlays—
TLS
VariableDescriptionDefault
SAFETY_KERNEL_TLS_CERT / KEYgRPC server TLS certificate and key—
SAFETY_KERNEL_TLS_CAClient CA for mTLS verification—
SAFETY_KERNEL_TLS_REQUIREDRequire TLS for all connections—

The Safety Kernel reads policy bundle fragments from the config service in Redis. Ensure REDIS_URL is set when using pack policy overlays.

Workflow Engine

VariableDescriptionDefault
WORKFLOW_ENGINE_HTTP_ADDRHTTP listen address for the workflow engine—
WORKFLOW_ENGINE_SCAN_INTERVALInterval for scanning pending workflow runs—
WORKFLOW_ENGINE_RUN_SCAN_LIMITMax runs to process per scan cycle—

Context Engine

VariableDescriptionDefault
CONTEXT_ENGINE_ADDRgRPC listen address—
CONTEXT_ENGINE_TLS_CERT / KEYgRPC server TLS certificate and key—
CONTEXT_ENGINE_TLS_CAClient CA for mTLS—
CONTEXT_ENGINE_TLS_REQUIREDRequire TLS for all connections—

Enterprise

VariableDescriptionDefault
CORDUM_LICENSE_PATHPath to signed license file—
CORDUM_LICENSE_KEYSigned license key/token—
CORDUM_REQUIRE_RBACEnable role checks on policy/config/approvals/packs—

Enterprise auth/licensing features are delivered by the separatecordum-enterpriserepository.

Config Files

Docker Compose mounts these files from config/. The control plane validates them against embedded JSON schemas at startup. Invalid configs return errors; timeouts fall back to defaults.

pools.yaml — Topic-to-Pool Routing

Maps job topics to worker pools. Each pool can declare requires — capability labels a worker must advertise to join the pool. The scheduler uses this mapping plus least-loaded strategy to dispatch jobs.

config/pools.yaml
topics:
  job.default: default
  job.sre-investigator.collect.k8s: sre-investigators
  job.sre-investigator.collect.logs: sre-investigators
  job.deploy-agents.apply: deploy-agents
  job.compliance-agents.process: compliance-agents

pools:
  default:
    requires: []
  sre-investigators:
    requires: [k8s, logs]
  deploy-agents:
    requires: [deploy]
  compliance-agents:
    requires: []

safety.yaml — Safety Kernel Policy

Per-tenant policy configuration for the Safety Kernel. Controls topic allow/deny lists, repository host restrictions, and MCP (Model Context Protocol) server/tool/resource filtering.

config/safety.yaml
default_tenant: default
tenants:
  default:
    allow_topics:
      - "job.*"
    deny_topics:
      - "sys.*"
    allowed_repo_hosts: []
    denied_repo_hosts: []
    mcp:
      allow_servers: []
      deny_servers: []
      allow_tools: []
      deny_tools: []
      allow_resources: []
      deny_resources: []
      allow_actions: []
      deny_actions: []
MCP Filtering: When a job declares MCP labels, the Safety Kernel evaluates these allow/deny lists. Leave lists empty to allow all MCP calls by default.

system.yaml — System-Wide Configuration

Sample payload for the config service. Not mounted by default — use POST /api/v1/config to store it. Controls budgets, rate limits, retry policy, resource limits, model access, context windows, SLOs, and integrations.

config/system.yaml
safety:
  pii_detection_enabled: true
  pii_action: "block"
  injection_detection: true
  injection_sensitivity: "high"
  content_filter_enabled: true

budget:
  daily_limit_usd: 1000.0
  monthly_limit_usd: 10000.0
  per_job_max_usd: 5.0
  per_workflow_max_usd: 50.0
  alert_at_percent: [50, 75, 90, 100]
  action_at_limit: "throttle"

rate_limits:
  requests_per_minute: 120000
  concurrent_jobs: 10000
  concurrent_workflows: 5
  queue_size: 5000

retry:
  max_retries: 3
  initial_backoff: 1s
  max_backoff: 30s
  backoff_multiplier: 2.0

resources:
  default_priority: "interactive"
  max_timeout_seconds: 300
  default_timeout_seconds: 60
  max_parallel_steps: 10

models:
  allowed_models: ["gpt-4", "llama-3", "claude-3"]
  default_model: "gpt-4"
  fallback_models: ["llama-3"]

context:
  max_context_tokens: 4000
  max_retrieved_chunks: 10
  cross_tenant_access: false

slo:
  target_p95_latency_ms: 1000
  error_rate_budget: 0.01
  timeout_seconds: 60
Budget

Daily/monthly/per-job/per-workflow USD limits with alert thresholds and throttle action

Rate Limits

RPM, burst, concurrent jobs/workflows, and queue depth limits per tenant

Retry

Max retries, exponential backoff (initial, max, multiplier), retryable error classes

Resources

Priority, timeouts, max parallel steps, preemption settings

Models

Allowed/default/fallback model lists for LLM-backed jobs

Context

Token limits, chunk retrieval, cross-tenant access, allowed connectors

timeouts.yaml — Timeout Configuration

Per-topic and per-workflow timeout overrides. The reconciler uses these values to mark stale DISPATCHED and RUNNING jobs as TIMEOUT.

config/timeouts.yaml
# Per-workflow timeouts (keyed by workflow ID)
workflows: {}

# Per-topic timeouts (keyed by topic pattern)
topics: {}

# Reconciler settings
reconciler:
  dispatch_timeout_seconds: 300   # 5 min for DISPATCHED → TIMEOUT
  running_timeout_seconds: 9000   # 2.5 hrs for RUNNING → TIMEOUT
  scan_interval_seconds: 30       # How often reconciler scans

nats.conf — NATS Server Configuration

NATS server configuration for JetStream durability, cluster settings, and authorization. Mounted into the NATS container via Docker Compose or Kubernetes ConfigMap.

config/nats.conf
listen: 0.0.0.0:4222

jetstream {
  store_dir: /data/jetstream
  max_mem: 1G
  max_file: 10G
  sync_interval: "1s"   # fsync cadence (lower = safer, slower)
}

# Optional cluster config
# cluster {
#   listen: 0.0.0.0:6222
#   routes: [nats-route://nats-1:6222, nats-route://nats-2:6222]
# }

Config Scopes & Merging

Configuration follows a scope hierarchy. More specific scopes override broader ones using shallow merge — later values replace earlier values at the top level.

Scope Hierarchy (broadest → most specific)
1
systemcfg:system:<id>

Platform-wide defaults (budgets, rate limits, models)

2
orgcfg:org:<id>

Organization-level overrides

3
teamcfg:team:<id>

Team-level overrides

4
workflowcfg:workflow:<id>

Per-workflow overrides

5
stepcfg:step:<id>

Per-step overrides (most specific)

Merge Behavior

  • Shallow merge: more-specific scope keys replace broader scope keys
  • Arrays are replaced, not appended
  • Missing scopes are skipped — only defined scopes participate
  • Final merged result is the "effective config"

REST API (Config)

REST API — config
# Get config at a scope (envelope mode)
curl "http://localhost:8081/api/v1/config?scope=system&scope_id=default&envelope=true" \
  -H "X-API-Key: $CORDUM_API_KEY" \
  -H "X-Tenant-ID: default"

# Set config at a scope
curl -X POST http://localhost:8081/api/v1/config \
  -H "X-API-Key: $CORDUM_API_KEY" \
  -H "X-Tenant-ID: default" \
  -H "Content-Type: application/json" \
  -d '{"scope":"system","scope_id":"default","data":{"timeouts":{"job_timeout_sec":300}}}'

# View merged effective config
curl "http://localhost:8081/api/v1/config/effective?workflow_id=sre.triage" \
  -H "X-API-Key: $CORDUM_API_KEY" \
  -H "X-Tenant-ID: default"

Hot Reload

Config files and policy are hot-reloaded without service restart. Each revision is tracked with a SHA256 hash for cache validation and rollback.

Config Overlays

The scheduler reloads config overlays from Redis at SCHEDULER_CONFIG_RELOAD_INTERVAL (default 30s). Changes to pool routing, timeouts, and system config take effect without restart.

Safety Policy

The Safety Kernel watches the policy file at SAFETY_POLICY_RELOAD_INTERVAL and reloads policy bundle fragments from the config service. Each reload generates a new snapshot with SHA256 hash.

API Keys

When using CORDUM_API_KEYS_PATH, the gateway watches the file and reloads keys on change. No restart required for key rotation.

Revision Tracking
SHA256 Hash — each config/policy revision is identified by content hash
Cache Validation — services compare hashes to skip unnecessary reprocessing
Snapshot Versioning — Safety Kernel keeps versioned snapshots for audit and rollback
Last-Known-Good — if verification fails, falls back to last valid revision

NATS JetStream Durability

JetStream fsync cadence is controlled by sync_interval in the NATS server config. Lower values improve crash durability at the cost of throughput.

Docker Compose

Edit config/nats.conf

K8s Base

Edit cordum-nats-config ConfigMap in deploy/k8s/base.yaml

K8s Production

Edit ConfigMap in deploy/k8s/production/nats.yaml

Helm

Set nats.jetstream.syncInterval in values.yaml