Documentation
Operations
Operationally, the most important signals are gateway status, service health, compose/container logs, and job lifecycle outcomes in DLQ and approvals.
Health endpoints
- GET /health on the gateway for basic liveness.
- GET /api/v1/status for gateway + NATS + Redis connectivity.
- Gateway metrics on :9092/metrics by default.
- Workflow engine health on WORKFLOW_ENGINE_HTTP_ADDR (default :9093).
- Scheduler metrics on SCHEDULER_METRICS_ADDR (default :9090).
Status checks
cordumctl status curl -sS https://localhost:8081/api/v1/status --cacert ./certs/ca/ca.crt -H "X-API-Key: $CORDUM_API_KEY" -H "X-Tenant-ID: default" | jq
Logs and recovery
Compose logs
docker compose ps docker compose logs -f api-gateway scheduler safety-kernel workflow-engine dashboard docker compose logs --tail=200 redis docker compose logs --tail=200 nats
Scheduler notes
- The scheduler reconciles stale jobs and marks timed-out work based on current timeout configuration.
- Config reloads use NATS notifications and Redis polling, so pool and timeout changes propagate without restarting every replica.
- DLQ inspection and retry use
GET /api/v1/dlqandPOST /api/v1/dlq/{job_id}/retry.
Reset local state
Destructive local reset
docker compose exec redis redis-cli FLUSHALL docker compose down -v
Warning
This wipes Redis state and the Compose volumes used by NATS and Redis. Use it only for local development and reproducible test resets.