Name: Cordum
Author: Cordum

The production problem

Worker registries are eventually fresh, not instantly fresh.

Dispatch loops are usually fast. Heartbeat expiry windows are usually slower.

That mismatch creates a stale-worker gap: scheduler can repick quickly, still hit stale entries, then escalate to generic retry paths.

If you cannot measure this gap, you will tune the wrong thing.

What top results cover and miss

Source	Strong coverage	Missing piece
Kubernetes Readiness / Liveness Probes	How endpoint health gates traffic routing and why stale endpoints need explicit health signaling.	No scheduler-level retry-loop design for direct worker subject selection failures.
AWS ALB Target Group Health Checks	Load balancer health check lifecycle and target registration/routing behavior.	No per-job dispatch semantics when stale worker identity is already chosen.
AWS Target Group Health Thresholds	Fail-open threshold tradeoffs during transient health-check instability.	No mapping from stale endpoint events to message-bus scheduler reason codes and retry budgets.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Inner dispatch loop	`processJob` runs `for dispatchAttempt := range maxDispatchRetries` where `maxDispatchRetries = 3`.	Only three immediate stale-worker re-picks before outer retry path.
Stale detection rule	Direct subjects (`worker.<id>.jobs`) are checked with `registry.IsAlive(workerID)`.	Topic-level fanout subjects bypass this direct stale-worker check.
Worker TTL window	Memory registry TTL is `30s` by default (`defaultWorkerTTL`).	Stale worker entries can persist longer than a fast three-attempt loop.
Error classification	Stale-worker exhaustion is wrapped as `ErrNoWorkers` and treated as retryable.	Operational telemetry may overstate generic no-worker pressure instead of stale-routing pressure.
Test coverage	Registry TTL/IsAlive tests exist; no dedicated test targets stale-worker loop exhaustion behavior.	Regression risk remains for this specific dispatch branch.

Scheduler code paths

Inner stale-worker retry loop

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
const maxDispatchRetries = 3

for dispatchAttempt := range maxDispatchRetries {
  if dispatchAttempt > 0 {
    workers = e.registry.Snapshot()
  }
  subject, err = e.strategy.PickSubject(req, workers)
  if err != nil { break }

  workerID := extractWorkerFromSubject(subject)
  if workerID == "" || e.registry.IsAlive(workerID) {
    break
  }

  err = fmt.Errorf("%w: pool %q (stale worker %s)", ErrNoWorkers, topic, workerID)
}

Worker TTL and liveness check

core/controlplane/scheduler/registry_memory.go

// core/controlplane/scheduler/registry_memory.go (excerpt)
const defaultWorkerTTL = 30 * time.Second

func (r *MemoryRegistry) IsAlive(workerID string) bool {
  entry, ok := r.workers[workerID]
  if !ok {
    return false
  }
  return time.Since(entry.lastSeen) <= r.ttl
}

Current coverage and explicit test gap

core/controlplane/scheduler/*_test.go

// Existing tests cover liveness primitives:
// - TestMemoryRegistry_IsAlive
// - TestMemoryRegistry_ExpiresStaleWorkers
//
// Missing today:
// - explicit test that stale-worker selection retries exactly 3 times
// - explicit assertion of reason-code/attempt behavior after stale loop exhaustion

Validation runbook

Treat stale-worker retries as a measurable reliability pattern, not just occasional log noise.

stale-worker-dispatch-runbook.sh

bash

# 1) Validate current liveness primitives
go test ./core/controlplane/scheduler -run TestMemoryRegistry_IsAlive -count=1
go test ./core/controlplane/scheduler -run TestMemoryRegistry_ExpiresStaleWorkers -count=1

# 2) Trigger a direct-worker dispatch with labels in staging
cordumctl job submit --topic job.default --prompt "stale-worker probe" --labels '{"preferred_worker_id":"worker-stale"}'

# 3) Inspect scheduler logs for stale loop events
rg "stale worker detected, retrying dispatch" /var/log/cordum/scheduler.log

# 4) Correlate with no_workers retries and worker heartbeat freshness
cordumctl job status <job_id> --json

Limitations and tradeoffs

Approach	Upside	Downside
Fixed 3 immediate retries (current)	Low latency and simple behavior for short stale windows.	Can under-retry relative to 30s TTL windows and hide stale-specific root cause.
More inner retries with tiny jitter	Better chance to recover before outer backoff and queue churn.	Higher dispatch-path CPU/log volume under sustained stale conditions.
Dedicated stale-worker reason + metric	Sharper observability and better capacity tuning decisions.	Extra code path and compatibility work for reason-code consumers.

- This analysis focuses on dispatch selection freshness, not policy or approval gating behavior.
- Outer backoff still protects the system, but stale-specific observability is weak today.
- Without dedicated stale-loop tests, future refactors can regress behavior quietly.

Next step

Implement this next:

1. Add a dedicated unit test for stale-worker loop exhaustion and expected retry behavior.
2. Introduce `stale_worker` reason-code mapping separate from generic `no_workers`.
3. Add `scheduler_stale_worker_retry_total` metric with topic and worker labels.
4. Re-evaluate `maxDispatchRetries=3` against observed heartbeat lag percentiles.

Continue with AI Agent Worker Heartbeat Warm Start and AI Agent Dispatch Rollback Consistency.

AI Agent Stale Worker Dispatch Retries