Intelligent Scheduler

The scheduler that knows when to stop.

Policy-before-dispatch. Least-loaded routing. Capability filtering. This is not your typical job queue.

Core Capabilities

Intelligence at the dispatch layer.

The scheduler doesn't just route jobs—it evaluates policy, selects optimal workers, and handles failure gracefully.

Least-Loaded Selection

Workers scored by active_jobs + cpu_load/100 + gpu_utilization/100. Lowest score wins.

Direct Dispatch

Route to worker.<id>.jobs for low-latency, targeted execution.

Capability Filtering

Jobs with requires=[kubectl, GPU] only go to eligible workers.

Policy-Before-Dispatch

Every job calls Safety Kernel before execution. No exceptions.

Algorithm

Least-loaded wins.

Workers are scored in real-time. The scheduler picks the worker with the lowest score for optimal load distribution.

Every worker reports heartbeat metrics: active_jobs, cpu_load, gpu_utilization.

The scheduler calculates a score and picks the lowest. If a worker is overloaded (active_jobs > 90% of max_parallel_jobs), the scheduler waits or DLQs.

Direct dispatch: If preferred_worker_id is set, the scheduler bypasses scoring and routes directly to worker.<id>.jobs.

Capability filtering: Jobs with requires=[kubectl, GPU] only go to workers that advertise those capabilities in their heartbeat.

Scheduling Algorithm

// Worker score calculation
score = active_jobs + cpu_load/100 + gpu_utilization/100

// Selection priority:
1. Check preferred_worker_id label
   → Direct dispatch to worker.<id>.jobs

2. Filter workers by requires capabilities
   → Only workers with [kubectl, GPU, ...] eligible

3. Pick lowest score worker from pool
   → Least-loaded gets the job

4. Overload detection
   if (active_jobs > max_parallel_jobs * 0.9) {
     reason = "pool_overloaded"
     // Scheduler waits or DLQs
   }

5. Fallback to topic queue
   → If no workers available, publish to job.*

Policy Integration

No job executes without approval.

The scheduler calls the Safety Kernel before every dispatch. DENY stops the job. REQUIRE_APPROVAL pauses for human review.

Policy Flow

1.
Scheduler receives sys.job.submit
2.
Sets job state to PENDING
3.
Calls Safety Kernel gRPC Check
4.
ALLOW → dispatch immediately
REQUIRE_APPROVAL → pause and notify
DENY → reject and DLQ
ALLOW_WITH_CONSTRAINTS → apply limits and dispatch
5.
Decision recorded in audit trail with rule_id and reason

Why This Matters

Temporal doesn't have this. Jobs execute immediately after dispatch. Policy checks, if any, are manual.

n8n doesn't have this. Workflows run with no policy layer. Approvals are bolt-on UI steps.

Cordum makes it foundational. Every job—whether from the API, a workflow step, or a pack—goes through the Safety Kernel before a single line of code executes.

The result: Security teams can approve autonomous workflows because the policy layer is built in, not bolted on.

Reconciler + Pending Replayer

// Reconciler: marks stale jobs as TIMEOUT
reconciler_interval = 30s

for job in JobStore.index("DISPATCHED", "RUNNING") {
  if (now - job.updated_at > timeout) {
    job.state = TIMEOUT
    create_dlq_entry(job)
  }
}

// Pending Replayer: retries stuck PENDING jobs
pending_replay_interval = 60s

for job in JobStore.index("PENDING") {
  if (now - job.created_at > dispatch_timeout) {
    log("Replaying stuck PENDING job", job.id)
    republish_to_scheduler(job)
  }
}

Production Ready

Failure handling built in.

Reconciler marks stale jobs as TIMEOUT. Pending replayer retries stuck jobs. No jobs fall through the cracks.

Least-loaded worker scoring algorithm

Direct worker dispatch (worker.<id>.jobs)

Capability-based pool routing

Label hints (preferred_pool, preferred_worker_id)

Overload detection and backpressure

Reconciler marks stale jobs as TIMEOUT

Pending replayer retries stuck PENDING jobs

Per-job locks for idempotency (JetStream)

Pool mapping from config service

Timeout enforcement from config/timeouts.yaml

Configuration

Config-driven routing.

Pool mappings and timeouts live in the config service. Update them without restarting the scheduler.

config/pools.yaml

topics:
  job.sre-investigator.collect.k8s:
    - sre-investigator-pool
  job.incident-enricher.triage:
    - incident-pool
  job.default:
    - default-pool

pools:
  sre-investigator-pool:
    requires:
      - kubectl
      - network:egress
  incident-pool:
    requires:
      - network:egress
  default-pool:
    requires: []

config/timeouts.yaml

# Topic-specific timeouts
topics:
  job.sre-investigator.collect.k8s:
    dispatch_timeout: 30s
    execution_timeout: 300s
  job.default:
    dispatch_timeout: 10s
    execution_timeout: 120s

# Workflow-specific timeouts
workflows:
  sre-investigator.full-investigation:
    step_timeout: 600s
    run_timeout: 3600s

Hot reload: The scheduler reloads pools and timeouts from the config service every SCHEDULER_CONFIG_RELOAD_INTERVAL (default 30s). Update config, wait 30 seconds, and the scheduler picks up the changes—no restart required.