Documentation

Operate Cordum.

Metrics endpoints, key environment variables, and operational notes for running the control plane in production.

terminal — health check

# Check all service health
cordumctl status

# Verify gateway is reachable
curl -s http://localhost:8080/healthz | jq .

# View platform version
cordumctl version

Metrics & Health

API gateway: :8080 (HTTP), :8081 (gRPC), :9092 (WebSocket/metrics)
Scheduler: :9090/metrics (Prometheus metrics only)
Workflow engine: :9093/health

Dashboard Config

The dashboard reads /config.json for connectivity settings. Production images generate this from environment variables at runtime.

View dashboard configuration docs →

Key Environment Variables

NATS_URL, REDIS_URLCore infrastructure connections

NATS_USE_JETSTREAM=1Enable durability (optional)

POOL_CONFIG_PATHWorker pool definitions

SAFETY_KERNEL_ADDRPolicy service location

GATEWAY_HTTP_ADDRPublic API binding

CORDUM_API_KEYAdmin authentication key

Reset State (Dev Only)

terminal

docker compose exec redis redis-cli FLUSHALL
docker compose down -v

Use these commands to completely wipe the system state. The second command removes JetStream persistence and the Redis volume.

Warning: This is destructive.

Reconciler

What It Does

The reconciler runs periodically (default: 30s) and scans for stale jobs in DISPATCHED or RUNNING states.

If now - job.updated_at > timeout, the job is marked as TIMEOUT and a DLQ entry is created.

Timeouts are defined in config/timeouts.yaml (bootstrapped into config service on startup).

config/timeouts.yaml

# Topic-specific timeouts
topics:
  job.sre-investigator.collect.k8s:
    dispatch_timeout: 30s
    execution_timeout: 300s
  job.default:
    dispatch_timeout: 10s
    execution_timeout: 120s

# Workflow-specific timeouts
workflows:
  sre-investigator.full-investigation:
    step_timeout: 600s
    run_timeout: 3600s

Reconciler Interval

Env: SCHEDULER_RECONCILER_INTERVAL=30s

The reconciler runs every 30s by default. Adjust based on your timeout requirements and scale.

Pending Replayer

What It Does

The pending replayer scans for jobs stuck in PENDING state past the dispatch timeout.

These jobs are republished to the scheduler to retry dispatch. This prevents deadlocks where jobs are stuck waiting for a worker that never picks them up.

Common causes: worker pool overloaded, no workers available, capability mismatch.

Configuration

SCHEDULER_PENDING_REPLAY_INTERVAL=60s

Runs every 60s by default. Jobs older than dispatch_timeout are replayed.

Note: Replaying too aggressively can cause duplicate dispatch attempts. Use with care.

Dead Letter Queue (DLQ)

Terminal Failures Go to DLQ

Every job with a non-SUCCEEDED terminal state creates a DLQ entry:

FAILED - Legacy/unspecified failure
FAILED_FATAL - Non-recoverable failure (rollback trigger)
TIMEOUT - Marked by reconciler
CANCELLED - Cancelled via sys.job.cancel
DENIED - Blocked by Safety Kernel

FAILED_RETRYABLE is treated as transient when retries are enabled. If retries are exhausted, the job is marked terminal and the DLQ entry is created.

DLQ entries include: error_code, error_message, last_state, attempts, and policy decision reason (if DENIED).

If a job fails with FAILED_FATAL and a compensation template exists, rollback is dispatched before manual intervention. Compensation failures are logged and tracked via metrics for operator review.

List DLQ Entries

GET /api/v1/dlq
# Returns paginated DLQ entries

# Filter by tenant
GET /api/v1/dlq?tenant_id=acme

# Get specific entry
GET /api/v1/dlq/{job_id}

Retry or Delete

# Retry: creates new job with same input
POST /api/v1/dlq/{job_id}/retry
# Returns new job_id

# Delete DLQ entry
DELETE /api/v1/dlq/{job_id}

# Bulk delete (use with care)
DELETE /api/v1/dlq?tenant_id=acme&before=2024-01-01

Observability & Metrics

Scheduler Metrics

:9090/metrics (Prometheus only)

• Job state transitions
• Dispatch latency
• Safety Kernel call duration
• Worker pool stats
• Saga rollbacks + compensation outcomes
• Rollback duration + active rollback gauge

API Gateway

:8080 HTTP, :8081 gRPC, :9092 WS/metrics

• HTTP request latency
• WebSocket connections
• API key auth failures
• Rate limiting stats

Workflow Engine Health

:9093/health

• Run timeline writes
• Step dispatch lag
• Approval gate stats
• Reconciler loop time

Saga Metrics

:9090/metrics

• cordum_saga_rollbacks_total
• cordum_saga_compensation_failed_total
• cordum_saga_active
• cordum_saga_rollback_duration_seconds

Trace Linkage

Jobs include trace_id in BusPacket envelope. All jobs in a workflow run or fan-out share the same trace_id. Query jobs by trace:

GET /api/v1/jobs?trace_id=abc123

Common Issues & Troubleshooting

Jobs Stuck in PENDING

Symptoms: Jobs never transition from PENDING to DISPATCHED.

Causes:

No workers available for the topic
Worker pool overloaded (active_jobs > max_parallel_jobs)
Capability mismatch (job requires=[kubectl] but no workers advertise kubectl)
Safety Kernel unreachable (gRPC connection failed)

Debug:

Check worker heartbeats: GET /api/v1/workers
Check scheduler logs for policy check failures
Verify pool mappings in config service
Test policy decisions via REST API: POST /api/v1/policy/evaluate

Jobs Timing Out

Symptoms: Jobs marked as TIMEOUT by reconciler.

Causes:

Worker crashed or lost connection to NATS
Job execution time exceeds execution_timeout
Worker didn't publish JobResult after completing work

Fix:

Increase timeout in config/timeouts.yaml
Check worker logs for crashes or errors
Verify worker is using CAP SDK runtime for consistent heartbeats

Policy Changes Not Taking Effect

Symptoms: Updated safety.yaml but decisions unchanged.

Causes:

Safety Kernel hasn't reloaded yet (reloads every 30s by default)
Policy file has syntax error (Safety Kernel falls back to last-known-good)
Policy bundle not published to config service

Fix:

Wait 30s or restart Safety Kernel to force reload
Check Safety Kernel logs for parse errors
Use POST /api/v1/policy/simulate to test policy before deploying
Verify bundle is enabled: GET /api/v1/policy/bundles

Production Checklist

Must-Haves

✓Enable JetStream for durability (NATS_USE_JETSTREAM=1)
✓Configure per-tenant API keys (CORDUM_API_KEYS)
✓Set up Prometheus scraping for :9090 and :9092
✓Configure backup for Redis (RDB snapshots + AOF)
✓Enable RBAC if using enterprise features

Operate Cordum.

What It Does

Reconciler Interval

What It Does

Configuration

Terminal Failures Go to DLQ

Scheduler Metrics

API Gateway

Workflow Engine Health

Saga Metrics

Trace Linkage

Jobs Stuck in PENDING

Jobs Timing Out

Policy Changes Not Taking Effect

Must-Haves

Recommended