Skip to content
Documentation

Operate Cordum.

Metrics endpoints, key environment variables, and operational notes for running the control plane in production.

terminal — health check
# Check all service health
cordumctl status

# Verify gateway is reachable
curl -s http://localhost:8080/healthz | jq .

# View platform version
cordumctl version
Metrics & Health
  • API gateway: :8080 (HTTP), :8081 (gRPC), :9092 (WebSocket/metrics)
  • Scheduler: :9090/metrics (Prometheus metrics only)
  • Workflow engine: :9093/health
Dashboard Config

The dashboard reads /config.json for connectivity settings. Production images generate this from environment variables at runtime.

Key Environment Variables
NATS_URL, REDIS_URLCore infrastructure connections
NATS_USE_JETSTREAM=1Enable durability (optional)
POOL_CONFIG_PATHWorker pool definitions
SAFETY_KERNEL_ADDRPolicy service location
GATEWAY_HTTP_ADDRPublic API binding
CORDUM_API_KEYAdmin authentication key
Reset State (Dev Only)
terminal
docker compose exec redis redis-cli FLUSHALL
docker compose down -v

Use these commands to completely wipe the system state. The second command removes JetStream persistence and the Redis volume.

Warning: This is destructive.

Reconciler

What It Does

The reconciler runs periodically (default: 30s) and scans for stale jobs in DISPATCHED or RUNNING states.

If now - job.updated_at > timeout, the job is marked as TIMEOUT and a DLQ entry is created.

Timeouts are defined in config/timeouts.yaml (bootstrapped into config service on startup).

config/timeouts.yaml
# Topic-specific timeouts
topics:
  job.sre-investigator.collect.k8s:
    dispatch_timeout: 30s
    execution_timeout: 300s
  job.default:
    dispatch_timeout: 10s
    execution_timeout: 120s

# Workflow-specific timeouts
workflows:
  sre-investigator.full-investigation:
    step_timeout: 600s
    run_timeout: 3600s

Reconciler Interval

Env: SCHEDULER_RECONCILER_INTERVAL=30s

The reconciler runs every 30s by default. Adjust based on your timeout requirements and scale.

Pending Replayer

What It Does

The pending replayer scans for jobs stuck in PENDING state past the dispatch timeout.

These jobs are republished to the scheduler to retry dispatch. This prevents deadlocks where jobs are stuck waiting for a worker that never picks them up.

Common causes: worker pool overloaded, no workers available, capability mismatch.

Configuration

SCHEDULER_PENDING_REPLAY_INTERVAL=60s

Runs every 60s by default. Jobs older than dispatch_timeout are replayed.

Note: Replaying too aggressively can cause duplicate dispatch attempts. Use with care.

Dead Letter Queue (DLQ)

Terminal Failures Go to DLQ

Every job with a non-SUCCEEDED terminal state creates a DLQ entry:

  • FAILED - Legacy/unspecified failure
  • FAILED_FATAL - Non-recoverable failure (rollback trigger)
  • TIMEOUT - Marked by reconciler
  • CANCELLED - Cancelled via sys.job.cancel
  • DENIED - Blocked by Safety Kernel

FAILED_RETRYABLE is treated as transient when retries are enabled. If retries are exhausted, the job is marked terminal and the DLQ entry is created.

DLQ entries include: error_code, error_message, last_state, attempts, and policy decision reason (if DENIED).

If a job fails with FAILED_FATAL and a compensation template exists, rollback is dispatched before manual intervention. Compensation failures are logged and tracked via metrics for operator review.

List DLQ Entries
GET /api/v1/dlq
# Returns paginated DLQ entries

# Filter by tenant
GET /api/v1/dlq?tenant_id=acme

# Get specific entry
GET /api/v1/dlq/{job_id}
Retry or Delete
# Retry: creates new job with same input
POST /api/v1/dlq/{job_id}/retry
# Returns new job_id

# Delete DLQ entry
DELETE /api/v1/dlq/{job_id}

# Bulk delete (use with care)
DELETE /api/v1/dlq?tenant_id=acme&before=2024-01-01
Observability & Metrics

Scheduler Metrics

:9090/metrics (Prometheus only)

  • • Job state transitions
  • • Dispatch latency
  • • Safety Kernel call duration
  • • Worker pool stats
  • • Saga rollbacks + compensation outcomes
  • • Rollback duration + active rollback gauge

API Gateway

:8080 HTTP, :8081 gRPC, :9092 WS/metrics

  • • HTTP request latency
  • • WebSocket connections
  • • API key auth failures
  • • Rate limiting stats

Workflow Engine Health

:9093/health

  • • Run timeline writes
  • • Step dispatch lag
  • • Approval gate stats
  • • Reconciler loop time

Saga Metrics

:9090/metrics

  • • cordum_saga_rollbacks_total
  • • cordum_saga_compensation_failed_total
  • • cordum_saga_active
  • • cordum_saga_rollback_duration_seconds

Trace Linkage

Jobs include trace_id in BusPacket envelope. All jobs in a workflow run or fan-out share the same trace_id. Query jobs by trace:

GET /api/v1/jobs?trace_id=abc123
Common Issues & Troubleshooting

Jobs Stuck in PENDING

Symptoms: Jobs never transition from PENDING to DISPATCHED.

Causes:

  • No workers available for the topic
  • Worker pool overloaded (active_jobs > max_parallel_jobs)
  • Capability mismatch (job requires=[kubectl] but no workers advertise kubectl)
  • Safety Kernel unreachable (gRPC connection failed)

Debug:

  • Check worker heartbeats: GET /api/v1/workers
  • Check scheduler logs for policy check failures
  • Verify pool mappings in config service
  • Test policy decisions via REST API: POST /api/v1/policy/evaluate

Jobs Timing Out

Symptoms: Jobs marked as TIMEOUT by reconciler.

Causes:

  • Worker crashed or lost connection to NATS
  • Job execution time exceeds execution_timeout
  • Worker didn't publish JobResult after completing work

Fix:

  • Increase timeout in config/timeouts.yaml
  • Check worker logs for crashes or errors
  • Verify worker is using CAP SDK runtime for consistent heartbeats

Policy Changes Not Taking Effect

Symptoms: Updated safety.yaml but decisions unchanged.

Causes:

  • Safety Kernel hasn't reloaded yet (reloads every 30s by default)
  • Policy file has syntax error (Safety Kernel falls back to last-known-good)
  • Policy bundle not published to config service

Fix:

  • Wait 30s or restart Safety Kernel to force reload
  • Check Safety Kernel logs for parse errors
  • Use POST /api/v1/policy/simulate to test policy before deploying
  • Verify bundle is enabled: GET /api/v1/policy/bundles
Production Checklist

Must-Haves

  • ✓Enable JetStream for durability (NATS_USE_JETSTREAM=1)
  • ✓Configure per-tenant API keys (CORDUM_API_KEYS)
  • ✓Set up Prometheus scraping for :9090 and :9092
  • ✓Configure backup for Redis (RDB snapshots + AOF)
  • ✓Enable RBAC if using enterprise features

Recommended

  • →Set up alerting on DLQ size growth
  • →Monitor reconciler lag via metrics
  • →Configure CORS allowlist for dashboard (CORS_ALLOWED_ORIGINS)
  • →Test policy changes with simulate API in staging first
  • →Document pack install/uninstall runbooks