Operate Cordum.
Metrics endpoints, key environment variables, and operational notes for running the control plane in production.
# Check all service health cordumctl status # Verify gateway is reachable curl -s http://localhost:8080/healthz | jq . # View platform version cordumctl version
API gateway: :8080 (HTTP), :8081 (gRPC), :9092 (WebSocket/metrics)Scheduler: :9090/metrics (Prometheus metrics only)Workflow engine: :9093/health
The dashboard reads /config.json for connectivity settings. Production images generate this from environment variables at runtime.
NATS_URL, REDIS_URLCore infrastructure connectionsNATS_USE_JETSTREAM=1Enable durability (optional)POOL_CONFIG_PATHWorker pool definitionsSAFETY_KERNEL_ADDRPolicy service locationGATEWAY_HTTP_ADDRPublic API bindingCORDUM_API_KEYAdmin authentication keydocker compose exec redis redis-cli FLUSHALL docker compose down -v
Use these commands to completely wipe the system state. The second command removes JetStream persistence and the Redis volume.
Warning: This is destructive.
What It Does
The reconciler runs periodically (default: 30s) and scans for stale jobs in DISPATCHED or RUNNING states.
If now - job.updated_at > timeout, the job is marked as TIMEOUT and a DLQ entry is created.
Timeouts are defined in config/timeouts.yaml (bootstrapped into config service on startup).
# Topic-specific timeouts
topics:
job.sre-investigator.collect.k8s:
dispatch_timeout: 30s
execution_timeout: 300s
job.default:
dispatch_timeout: 10s
execution_timeout: 120s
# Workflow-specific timeouts
workflows:
sre-investigator.full-investigation:
step_timeout: 600s
run_timeout: 3600sReconciler Interval
Env: SCHEDULER_RECONCILER_INTERVAL=30s
The reconciler runs every 30s by default. Adjust based on your timeout requirements and scale.
What It Does
The pending replayer scans for jobs stuck in PENDING state past the dispatch timeout.
These jobs are republished to the scheduler to retry dispatch. This prevents deadlocks where jobs are stuck waiting for a worker that never picks them up.
Common causes: worker pool overloaded, no workers available, capability mismatch.
Configuration
SCHEDULER_PENDING_REPLAY_INTERVAL=60s
Runs every 60s by default. Jobs older than dispatch_timeout are replayed.
Note: Replaying too aggressively can cause duplicate dispatch attempts. Use with care.
Terminal Failures Go to DLQ
Every job with a non-SUCCEEDED terminal state creates a DLQ entry:
FAILED- Legacy/unspecified failureFAILED_FATAL- Non-recoverable failure (rollback trigger)TIMEOUT- Marked by reconcilerCANCELLED- Cancelled via sys.job.cancelDENIED- Blocked by Safety Kernel
FAILED_RETRYABLE is treated as transient when retries are enabled. If retries are exhausted, the job is marked terminal and the DLQ entry is created.
DLQ entries include: error_code, error_message, last_state, attempts, and policy decision reason (if DENIED).
If a job fails with FAILED_FATAL and a compensation template exists, rollback is dispatched before manual intervention. Compensation failures are logged and tracked via metrics for operator review.
GET /api/v1/dlq
# Returns paginated DLQ entries
# Filter by tenant
GET /api/v1/dlq?tenant_id=acme
# Get specific entry
GET /api/v1/dlq/{job_id}# Retry: creates new job with same input
POST /api/v1/dlq/{job_id}/retry
# Returns new job_id
# Delete DLQ entry
DELETE /api/v1/dlq/{job_id}
# Bulk delete (use with care)
DELETE /api/v1/dlq?tenant_id=acme&before=2024-01-01Scheduler Metrics
:9090/metrics (Prometheus only)
- • Job state transitions
- • Dispatch latency
- • Safety Kernel call duration
- • Worker pool stats
- • Saga rollbacks + compensation outcomes
- • Rollback duration + active rollback gauge
API Gateway
:8080 HTTP, :8081 gRPC, :9092 WS/metrics
- • HTTP request latency
- • WebSocket connections
- • API key auth failures
- • Rate limiting stats
Workflow Engine Health
:9093/health
- • Run timeline writes
- • Step dispatch lag
- • Approval gate stats
- • Reconciler loop time
Saga Metrics
:9090/metrics
- • cordum_saga_rollbacks_total
- • cordum_saga_compensation_failed_total
- • cordum_saga_active
- • cordum_saga_rollback_duration_seconds
Trace Linkage
Jobs include trace_id in BusPacket envelope. All jobs in a workflow run or fan-out share the same trace_id. Query jobs by trace:
GET /api/v1/jobs?trace_id=abc123Jobs Stuck in PENDING
Symptoms: Jobs never transition from PENDING to DISPATCHED.
Causes:
- No workers available for the topic
- Worker pool overloaded (active_jobs > max_parallel_jobs)
- Capability mismatch (job requires=[kubectl] but no workers advertise kubectl)
- Safety Kernel unreachable (gRPC connection failed)
Debug:
- Check worker heartbeats:
GET /api/v1/workers - Check scheduler logs for policy check failures
- Verify pool mappings in config service
- Test policy decisions via REST API:
POST /api/v1/policy/evaluate
Jobs Timing Out
Symptoms: Jobs marked as TIMEOUT by reconciler.
Causes:
- Worker crashed or lost connection to NATS
- Job execution time exceeds
execution_timeout - Worker didn't publish JobResult after completing work
Fix:
- Increase timeout in
config/timeouts.yaml - Check worker logs for crashes or errors
- Verify worker is using CAP SDK runtime for consistent heartbeats
Policy Changes Not Taking Effect
Symptoms: Updated safety.yaml but decisions unchanged.
Causes:
- Safety Kernel hasn't reloaded yet (reloads every 30s by default)
- Policy file has syntax error (Safety Kernel falls back to last-known-good)
- Policy bundle not published to config service
Fix:
- Wait 30s or restart Safety Kernel to force reload
- Check Safety Kernel logs for parse errors
- Use
POST /api/v1/policy/simulateto test policy before deploying - Verify bundle is enabled:
GET /api/v1/policy/bundles
Must-Haves
- ✓Enable JetStream for durability (
NATS_USE_JETSTREAM=1) - ✓Configure per-tenant API keys (
CORDUM_API_KEYS) - ✓Set up Prometheus scraping for :9090 and :9092
- ✓Configure backup for Redis (RDB snapshots + AOF)
- ✓Enable RBAC if using enterprise features
Recommended
- →Set up alerting on DLQ size growth
- →Monitor reconciler lag via metrics
- →Configure CORS allowlist for dashboard (
CORS_ALLOWED_ORIGINS) - →Test policy changes with simulate API in staging first
- →Document pack install/uninstall runbooks
