The production problem
`no_pool_mapping` is often a configuration error.
Configuration errors are usually permanent until an operator changes state.
Permanent errors and transient errors should not share the same retry policy. If they do, queue time gets burned and incident triage slows down.
In Cordum today, this class is retryable. That choice is defensible during rollout races, but expensive during static misconfiguration.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Builders' Library: Timeouts, Retries, and Backoff with Jitter | Why retries amplify load and why jitter is required to prevent retry storms. | No scheduler-level guidance for config errors like missing topic-to-pool mappings. |
| Azure Architecture: Retry Pattern | Clear split between transient and non-transient faults with bounded retry strategy. | No concrete mapping for queue dispatch failures where fault type changes over rollout time. |
| gRPC Status Codes | `FAILED_PRECONDITION` guidance: do not retry until system state is explicitly fixed. | No direct model for message-bus schedulers that must choose between requeue and DLQ semantics. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Error origin | `LeastLoadedStrategy.PickSubject` returns `ErrNoPoolMapping` for unmapped topic, invalid `preferred_pool`, or unsatisfied `requires`. | Multiple configuration states collapse into one scheduler error class. |
| Retry classification | `isRetryableSchedulingError` includes `ErrNoPoolMapping` with worker-capacity errors. | Mapping misses follow the same retry path as transient `no_workers` and `pool_overloaded` scenarios. |
| Backoff policy | Retry delay uses `base=1s`, `max=30s`, jitter `<500ms`, exponential growth, and attempt cap `50`. | Good storm control. Slower operator feedback for truly permanent misconfiguration. |
| Terminal outcome | At `attempts >= 50`, scheduler sets FAILED and emits DLQ error code `max_scheduling_retries`. | Terminal record can obscure the original `no_pool_mapping` trigger. |
| Docs alignment | Some docs pages still describe `no_pool_mapping` as immediate fail-fast DLQ behavior. | Runbooks may assume behavior that no longer matches runtime. |
Scheduler code paths
Where `ErrNoPoolMapping` is emitted
// core/controlplane/scheduler/strategy_least_loaded.go (excerpt)
if poolHint != "" {
if !containsPool(topicPools, poolHint) {
return "", fmt.Errorf("%w: preferred pool %q not mapped for topic %q", ErrNoPoolMapping, poolHint, req.Topic)
}
}
if len(topicPools) == 0 {
return "", fmt.Errorf("%w: topic %q", ErrNoPoolMapping, req.Topic)
}
eligiblePools := filterEligiblePools(topicPools, jobRequires, routing.Pools)
if len(eligiblePools) == 0 {
return "", fmt.Errorf("%w: no pool satisfies requires", ErrNoPoolMapping)
}Why it gets retried
// core/controlplane/scheduler/engine.go + backoff.go (excerpt)
func isRetryableSchedulingError(err error) bool {
if errors.Is(err, ErrNoWorkers) ||
errors.Is(err, ErrPoolOverloaded) ||
errors.Is(err, ErrTenantLimit) ||
errors.Is(err, ErrNoPoolMapping) {
return true
}
return false
}
const (
backoffBase = 1 * time.Second
backoffMax = 30 * time.Second
backoffJitterMax = 500 * time.Millisecond
maxSchedulingRetries = 50
)How it terminates
// core/controlplane/scheduler/engine.go + engine_test.go (excerpt)
if attempts >= maxSchedulingRetries {
reason := fmt.Sprintf("max scheduling retries exceeded (attempts=%d)", attempts)
_ = e.setJobState(jobID, JobStateFailed)
_ = e.emitDLQWithRetry(jobID, topic, pb.JobStatus_JOB_STATUS_FAILED, reason, "max_scheduling_retries")
return nil
}
// Test: below cap still retries
jobStore.attempts["job-retry"] = maxSchedulingRetries - 1
err := engine.processJob(ctx, req, "trace-retry")
if _, ok := err.(*retryableError); !ok {
t.Fatalf("expected retryableError")
}Validation runbook
Use a tiny synthetic topic to test behavior before changing production retry semantics.
# 1) Reproduce mapping miss on a test topic JOB_ID=$(cordumctl job submit --topic job.unmapped.demo --prompt "ping") cordumctl job status "$JOB_ID" --json # 2) Verify pool topology and mappings cordumctl pool list cordumctl pool get default # 3) Add mapping and confirm new jobs recover cordumctl pool topic add default job.unmapped.demo JOB_ID2=$(cordumctl job submit --topic job.unmapped.demo --prompt "ping after mapping") cordumctl job status "$JOB_ID2" --json # 4) If original job is already in DLQ, replay it cordumctl dlq retry "$JOB_ID"
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Always retry `no_pool_mapping` (current) | Tolerates short config propagation gaps and bootstrap races. | Adds latency and hides permanent misconfiguration behind retry noise. |
| Immediate fail-fast DLQ | Fast operator signal and cleaner root-cause attribution. | Can create noisy DLQ bursts during normal rollout sequencing. |
| Hybrid split policy | Retries only suspected transient mapping races; permanent mapping faults fail fast. | Needs finer error taxonomy and a state signal (config epoch/hash) in scheduler path. |
- - This analysis is code-path based. It does not include live fleet percentiles for mapping-race duration.
- - If your deployment model applies pool overlays asynchronously, immediate fail-fast may create avoidable DLQ churn.
- - If your mapping is static and operator-managed, long retries delay useful alerts.
Next step
Implement this next:
- 1. Split `ErrNoPoolMapping` into finer classes: topic-missing, preferred-pool-missing, requires-unsatisfied.
- 2. Mark truly permanent classes as non-retryable so DLQ reason stays `no_pool_mapping` at first failure.
- 3. Keep a short configurable grace retry window for rollout races, then fail fast.
- 4. Align docs (`CORE.md`, `SCHEDULER_POOL_SPEC.md`) with actual runtime policy and retry budgets.
Continue with AI Agent Retry Intent Propagation and AI Agent Worker Pool Draining.