Skip to content
Deep Dive

AI Agent Stale Worker Dispatch Retries

A stale worker ID can outlive a fast retry loop. Then your scheduler says `no_workers` and moves on.

Deep Dive10 min readApr 2026
TL;DR
  • -Cordum retries stale direct-worker picks only three times in-process (`maxDispatchRetries = 3`).
  • -Worker liveness TTL defaults to `30s`, so stale snapshots can outlive a fast three-attempt loop.
  • -After loop exhaustion, stale selection is surfaced as retryable `ErrNoWorkers`, not a dedicated stale-worker reason code.
  • -This recovers eventually via outer backoff, but it can blur root cause and add avoidable retry churn.
Failure mode

Scheduler picks a direct worker subject that is no longer alive, then burns inner retries before outer backoff kicks in.

Current behavior

Three immediate re-picks with fresh snapshots. If still stale, classify as `no_workers` and retry later.

Operational payoff

Simple control flow and low dispatch latency when stale windows are brief.

Scope

This guide analyzes stale-worker handling in scheduler dispatch loops, not worker-side task idempotency.

The production problem

Worker registries are eventually fresh, not instantly fresh.

Dispatch loops are usually fast. Heartbeat expiry windows are usually slower.

That mismatch creates a stale-worker gap: scheduler can repick quickly, still hit stale entries, then escalate to generic retry paths.

If you cannot measure this gap, you will tune the wrong thing.

What top results cover and miss

SourceStrong coverageMissing piece
Kubernetes Readiness / Liveness ProbesHow endpoint health gates traffic routing and why stale endpoints need explicit health signaling.No scheduler-level retry-loop design for direct worker subject selection failures.
AWS ALB Target Group Health ChecksLoad balancer health check lifecycle and target registration/routing behavior.No per-job dispatch semantics when stale worker identity is already chosen.
AWS Target Group Health ThresholdsFail-open threshold tradeoffs during transient health-check instability.No mapping from stale endpoint events to message-bus scheduler reason codes and retry budgets.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Inner dispatch loop`processJob` runs `for dispatchAttempt := range maxDispatchRetries` where `maxDispatchRetries = 3`.Only three immediate stale-worker re-picks before outer retry path.
Stale detection ruleDirect subjects (`worker.<id>.jobs`) are checked with `registry.IsAlive(workerID)`.Topic-level fanout subjects bypass this direct stale-worker check.
Worker TTL windowMemory registry TTL is `30s` by default (`defaultWorkerTTL`).Stale worker entries can persist longer than a fast three-attempt loop.
Error classificationStale-worker exhaustion is wrapped as `ErrNoWorkers` and treated as retryable.Operational telemetry may overstate generic no-worker pressure instead of stale-routing pressure.
Test coverageRegistry TTL/IsAlive tests exist; no dedicated test targets stale-worker loop exhaustion behavior.Regression risk remains for this specific dispatch branch.

Scheduler code paths

Inner stale-worker retry loop

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
const maxDispatchRetries = 3

for dispatchAttempt := range maxDispatchRetries {
  if dispatchAttempt > 0 {
    workers = e.registry.Snapshot()
  }
  subject, err = e.strategy.PickSubject(req, workers)
  if err != nil { break }

  workerID := extractWorkerFromSubject(subject)
  if workerID == "" || e.registry.IsAlive(workerID) {
    break
  }

  err = fmt.Errorf("%w: pool %q (stale worker %s)", ErrNoWorkers, topic, workerID)
}

Worker TTL and liveness check

core/controlplane/scheduler/registry_memory.go
go
// core/controlplane/scheduler/registry_memory.go (excerpt)
const defaultWorkerTTL = 30 * time.Second

func (r *MemoryRegistry) IsAlive(workerID string) bool {
  entry, ok := r.workers[workerID]
  if !ok {
    return false
  }
  return time.Since(entry.lastSeen) <= r.ttl
}

Current coverage and explicit test gap

core/controlplane/scheduler/*_test.go
go
// Existing tests cover liveness primitives:
// - TestMemoryRegistry_IsAlive
// - TestMemoryRegistry_ExpiresStaleWorkers
//
// Missing today:
// - explicit test that stale-worker selection retries exactly 3 times
// - explicit assertion of reason-code/attempt behavior after stale loop exhaustion

Validation runbook

Treat stale-worker retries as a measurable reliability pattern, not just occasional log noise.

stale-worker-dispatch-runbook.sh
bash
# 1) Validate current liveness primitives
go test ./core/controlplane/scheduler -run TestMemoryRegistry_IsAlive -count=1
go test ./core/controlplane/scheduler -run TestMemoryRegistry_ExpiresStaleWorkers -count=1

# 2) Trigger a direct-worker dispatch with labels in staging
cordumctl job submit --topic job.default --prompt "stale-worker probe" --labels '{"preferred_worker_id":"worker-stale"}'

# 3) Inspect scheduler logs for stale loop events
rg "stale worker detected, retrying dispatch" /var/log/cordum/scheduler.log

# 4) Correlate with no_workers retries and worker heartbeat freshness
cordumctl job status <job_id> --json

Limitations and tradeoffs

ApproachUpsideDownside
Fixed 3 immediate retries (current)Low latency and simple behavior for short stale windows.Can under-retry relative to 30s TTL windows and hide stale-specific root cause.
More inner retries with tiny jitterBetter chance to recover before outer backoff and queue churn.Higher dispatch-path CPU/log volume under sustained stale conditions.
Dedicated stale-worker reason + metricSharper observability and better capacity tuning decisions.Extra code path and compatibility work for reason-code consumers.
  • - This analysis focuses on dispatch selection freshness, not policy or approval gating behavior.
  • - Outer backoff still protects the system, but stale-specific observability is weak today.
  • - Without dedicated stale-loop tests, future refactors can regress behavior quietly.

Next step

Implement this next:

  1. 1. Add a dedicated unit test for stale-worker loop exhaustion and expected retry behavior.
  2. 2. Introduce `stale_worker` reason-code mapping separate from generic `no_workers`.
  3. 3. Add `scheduler_stale_worker_retry_total` metric with topic and worker labels.
  4. 4. Re-evaluate `maxDispatchRetries=3` against observed heartbeat lag percentiles.

Continue with AI Agent Worker Heartbeat Warm Start and AI Agent Dispatch Rollback Consistency.

Freshness is a budget, not a boolean

If heartbeat freshness and dispatch retries are tuned independently, stale-worker churn will find the gap.