The production problem
Worker registries are eventually fresh, not instantly fresh.
Dispatch loops are usually fast. Heartbeat expiry windows are usually slower.
That mismatch creates a stale-worker gap: scheduler can repick quickly, still hit stale entries, then escalate to generic retry paths.
If you cannot measure this gap, you will tune the wrong thing.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes Readiness / Liveness Probes | How endpoint health gates traffic routing and why stale endpoints need explicit health signaling. | No scheduler-level retry-loop design for direct worker subject selection failures. |
| AWS ALB Target Group Health Checks | Load balancer health check lifecycle and target registration/routing behavior. | No per-job dispatch semantics when stale worker identity is already chosen. |
| AWS Target Group Health Thresholds | Fail-open threshold tradeoffs during transient health-check instability. | No mapping from stale endpoint events to message-bus scheduler reason codes and retry budgets. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Inner dispatch loop | `processJob` runs `for dispatchAttempt := range maxDispatchRetries` where `maxDispatchRetries = 3`. | Only three immediate stale-worker re-picks before outer retry path. |
| Stale detection rule | Direct subjects (`worker.<id>.jobs`) are checked with `registry.IsAlive(workerID)`. | Topic-level fanout subjects bypass this direct stale-worker check. |
| Worker TTL window | Memory registry TTL is `30s` by default (`defaultWorkerTTL`). | Stale worker entries can persist longer than a fast three-attempt loop. |
| Error classification | Stale-worker exhaustion is wrapped as `ErrNoWorkers` and treated as retryable. | Operational telemetry may overstate generic no-worker pressure instead of stale-routing pressure. |
| Test coverage | Registry TTL/IsAlive tests exist; no dedicated test targets stale-worker loop exhaustion behavior. | Regression risk remains for this specific dispatch branch. |
Scheduler code paths
Inner stale-worker retry loop
// core/controlplane/scheduler/engine.go (excerpt)
const maxDispatchRetries = 3
for dispatchAttempt := range maxDispatchRetries {
if dispatchAttempt > 0 {
workers = e.registry.Snapshot()
}
subject, err = e.strategy.PickSubject(req, workers)
if err != nil { break }
workerID := extractWorkerFromSubject(subject)
if workerID == "" || e.registry.IsAlive(workerID) {
break
}
err = fmt.Errorf("%w: pool %q (stale worker %s)", ErrNoWorkers, topic, workerID)
}Worker TTL and liveness check
// core/controlplane/scheduler/registry_memory.go (excerpt)
const defaultWorkerTTL = 30 * time.Second
func (r *MemoryRegistry) IsAlive(workerID string) bool {
entry, ok := r.workers[workerID]
if !ok {
return false
}
return time.Since(entry.lastSeen) <= r.ttl
}Current coverage and explicit test gap
// Existing tests cover liveness primitives: // - TestMemoryRegistry_IsAlive // - TestMemoryRegistry_ExpiresStaleWorkers // // Missing today: // - explicit test that stale-worker selection retries exactly 3 times // - explicit assertion of reason-code/attempt behavior after stale loop exhaustion
Validation runbook
Treat stale-worker retries as a measurable reliability pattern, not just occasional log noise.
# 1) Validate current liveness primitives
go test ./core/controlplane/scheduler -run TestMemoryRegistry_IsAlive -count=1
go test ./core/controlplane/scheduler -run TestMemoryRegistry_ExpiresStaleWorkers -count=1
# 2) Trigger a direct-worker dispatch with labels in staging
cordumctl job submit --topic job.default --prompt "stale-worker probe" --labels '{"preferred_worker_id":"worker-stale"}'
# 3) Inspect scheduler logs for stale loop events
rg "stale worker detected, retrying dispatch" /var/log/cordum/scheduler.log
# 4) Correlate with no_workers retries and worker heartbeat freshness
cordumctl job status <job_id> --jsonLimitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Fixed 3 immediate retries (current) | Low latency and simple behavior for short stale windows. | Can under-retry relative to 30s TTL windows and hide stale-specific root cause. |
| More inner retries with tiny jitter | Better chance to recover before outer backoff and queue churn. | Higher dispatch-path CPU/log volume under sustained stale conditions. |
| Dedicated stale-worker reason + metric | Sharper observability and better capacity tuning decisions. | Extra code path and compatibility work for reason-code consumers. |
- - This analysis focuses on dispatch selection freshness, not policy or approval gating behavior.
- - Outer backoff still protects the system, but stale-specific observability is weak today.
- - Without dedicated stale-loop tests, future refactors can regress behavior quietly.
Next step
Implement this next:
- 1. Add a dedicated unit test for stale-worker loop exhaustion and expected retry behavior.
- 2. Introduce `stale_worker` reason-code mapping separate from generic `no_workers`.
- 3. Add `scheduler_stale_worker_retry_total` metric with topic and worker labels.
- 4. Re-evaluate `maxDispatchRetries=3` against observed heartbeat lag percentiles.
Continue with AI Agent Worker Heartbeat Warm Start and AI Agent Dispatch Rollback Consistency.