The scheduler that knows when to stop.
Policy-before-dispatch. Least-loaded routing. Capability filtering. This is not your typical job queue.
Intelligence at the dispatch layer.
The scheduler doesn't just route jobs—it evaluates policy, selects optimal workers, and handles failure gracefully.
Least-loaded wins.
Workers are scored in real-time. The scheduler picks the worker with the lowest score for optimal load distribution.
Every worker reports heartbeat metrics: active_jobs, cpu_load, gpu_utilization.
The scheduler calculates a score and picks the lowest. If a worker is overloaded (active_jobs > 90% of max_parallel_jobs), the scheduler waits or DLQs.
Direct dispatch: If preferred_worker_id is set, the scheduler bypasses scoring and routes directly to worker.<id>.jobs.
Capability filtering: Jobs with requires=[kubectl, GPU] only go to workers that advertise those capabilities in their heartbeat.
// Worker score calculation
score = active_jobs + cpu_load/100 + gpu_utilization/100
// Selection priority:
1. Check preferred_worker_id label
→ Direct dispatch to worker.<id>.jobs
2. Filter workers by requires capabilities
→ Only workers with [kubectl, GPU, ...] eligible
3. Pick lowest score worker from pool
→ Least-loaded gets the job
4. Overload detection
if (active_jobs > max_parallel_jobs * 0.9) {
reason = "pool_overloaded"
// Scheduler waits or DLQs
}
5. Fallback to topic queue
→ If no workers available, publish to job.*No job executes without approval.
The scheduler calls the Safety Kernel before every dispatch. DENY stops the job. REQUIRE_APPROVAL pauses for human review.
Policy Flow
- 1.Scheduler receives
sys.job.submit - 2.Sets job state to
PENDING - 3.Calls Safety Kernel gRPC
Check - 4.ALLOW → dispatch immediately
REQUIRE_APPROVAL → pause and notify
DENY → reject and DLQ
ALLOW_WITH_CONSTRAINTS → apply limits and dispatch - 5.Decision recorded in audit trail with rule_id and reason
Why This Matters
Temporal doesn't have this. Jobs execute immediately after dispatch. Policy checks, if any, are manual.
n8n doesn't have this. Workflows run with no policy layer. Approvals are bolt-on UI steps.
Cordum makes it foundational. Every job—whether from the API, a workflow step, or a pack—goes through the Safety Kernel before a single line of code executes.
The result: Security teams can approve autonomous workflows because the policy layer is built in, not bolted on.
// Reconciler: marks stale jobs as TIMEOUT
reconciler_interval = 30s
for job in JobStore.index("DISPATCHED", "RUNNING") {
if (now - job.updated_at > timeout) {
job.state = TIMEOUT
create_dlq_entry(job)
}
}
// Pending Replayer: retries stuck PENDING jobs
pending_replay_interval = 60s
for job in JobStore.index("PENDING") {
if (now - job.created_at > dispatch_timeout) {
log("Replaying stuck PENDING job", job.id)
republish_to_scheduler(job)
}
}Failure handling built in.
Reconciler marks stale jobs as TIMEOUT. Pending replayer retries stuck jobs. No jobs fall through the cracks.
Config-driven routing.
Pool mappings and timeouts live in the config service. Update them without restarting the scheduler.
topics:
job.sre-investigator.collect.k8s:
- sre-investigator-pool
job.incident-enricher.triage:
- incident-pool
job.default:
- default-pool
pools:
sre-investigator-pool:
requires:
- kubectl
- network:egress
incident-pool:
requires:
- network:egress
default-pool:
requires: []# Topic-specific timeouts
topics:
job.sre-investigator.collect.k8s:
dispatch_timeout: 30s
execution_timeout: 300s
job.default:
dispatch_timeout: 10s
execution_timeout: 120s
# Workflow-specific timeouts
workflows:
sre-investigator.full-investigation:
step_timeout: 600s
run_timeout: 3600sHot reload: The scheduler reloads pools and timeouts from the config service every SCHEDULER_CONFIG_RELOAD_INTERVAL (default 30s). Update config, wait 30 seconds, and the scheduler picks up the changes—no restart required.