Skip to content
Deep Dive

AI Agent `no_pool_mapping` Retry Policy

Missing pool mappings look like config bugs, but schedulers still need to survive rollout races.

Deep Dive10 min readApr 2026
TL;DR
  • -Cordum currently treats `ErrNoPoolMapping` as retryable, not terminal, in `isRetryableSchedulingError`.
  • -Retry delay follows exponential backoff with crypto jitter: base `1s`, max `30s`, jitter `<500ms`, capped by `maxSchedulingRetries = 50`.
  • -At the retry cap, jobs fail with DLQ error code `max_scheduling_retries`, which can hide the original `no_pool_mapping` root cause.
  • -This policy helps during short config propagation races, but it can waste queue time for static misconfiguration.
Failure mode

A topic ships before its pool mapping. Jobs churn for many retries, then fail with a generic terminal reason.

Current behavior

Missing mapping is classified as retryable. The scheduler backs off and requeues until the retry cap is reached.

Operational payoff

Short-lived mapping races can self-heal without manual DLQ replay.

Scope

This guide covers scheduler behavior for missing topic-to-pool mappings in Cordum and the operational impact on retries, DLQ reason codes, and runbooks.

The production problem

`no_pool_mapping` is often a configuration error.

Configuration errors are usually permanent until an operator changes state.

Permanent errors and transient errors should not share the same retry policy. If they do, queue time gets burned and incident triage slows down.

In Cordum today, this class is retryable. That choice is defensible during rollout races, but expensive during static misconfiguration.

What top results cover and miss

SourceStrong coverageMissing piece
AWS Builders' Library: Timeouts, Retries, and Backoff with JitterWhy retries amplify load and why jitter is required to prevent retry storms.No scheduler-level guidance for config errors like missing topic-to-pool mappings.
Azure Architecture: Retry PatternClear split between transient and non-transient faults with bounded retry strategy.No concrete mapping for queue dispatch failures where fault type changes over rollout time.
gRPC Status Codes`FAILED_PRECONDITION` guidance: do not retry until system state is explicitly fixed.No direct model for message-bus schedulers that must choose between requeue and DLQ semantics.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Error origin`LeastLoadedStrategy.PickSubject` returns `ErrNoPoolMapping` for unmapped topic, invalid `preferred_pool`, or unsatisfied `requires`.Multiple configuration states collapse into one scheduler error class.
Retry classification`isRetryableSchedulingError` includes `ErrNoPoolMapping` with worker-capacity errors.Mapping misses follow the same retry path as transient `no_workers` and `pool_overloaded` scenarios.
Backoff policyRetry delay uses `base=1s`, `max=30s`, jitter `<500ms`, exponential growth, and attempt cap `50`.Good storm control. Slower operator feedback for truly permanent misconfiguration.
Terminal outcomeAt `attempts >= 50`, scheduler sets FAILED and emits DLQ error code `max_scheduling_retries`.Terminal record can obscure the original `no_pool_mapping` trigger.
Docs alignmentSome docs pages still describe `no_pool_mapping` as immediate fail-fast DLQ behavior.Runbooks may assume behavior that no longer matches runtime.

Scheduler code paths

Where `ErrNoPoolMapping` is emitted

core/controlplane/scheduler/strategy_least_loaded.go
go
// core/controlplane/scheduler/strategy_least_loaded.go (excerpt)
if poolHint != "" {
  if !containsPool(topicPools, poolHint) {
    return "", fmt.Errorf("%w: preferred pool %q not mapped for topic %q", ErrNoPoolMapping, poolHint, req.Topic)
  }
}
if len(topicPools) == 0 {
  return "", fmt.Errorf("%w: topic %q", ErrNoPoolMapping, req.Topic)
}
eligiblePools := filterEligiblePools(topicPools, jobRequires, routing.Pools)
if len(eligiblePools) == 0 {
  return "", fmt.Errorf("%w: no pool satisfies requires", ErrNoPoolMapping)
}

Why it gets retried

core/controlplane/scheduler/engine.go + backoff.go
go
// core/controlplane/scheduler/engine.go + backoff.go (excerpt)
func isRetryableSchedulingError(err error) bool {
  if errors.Is(err, ErrNoWorkers) ||
     errors.Is(err, ErrPoolOverloaded) ||
     errors.Is(err, ErrTenantLimit) ||
     errors.Is(err, ErrNoPoolMapping) {
    return true
  }
  return false
}

const (
  backoffBase      = 1 * time.Second
  backoffMax       = 30 * time.Second
  backoffJitterMax = 500 * time.Millisecond
  maxSchedulingRetries = 50
)

How it terminates

core/controlplane/scheduler/engine.go + engine_test.go
go
// core/controlplane/scheduler/engine.go + engine_test.go (excerpt)
if attempts >= maxSchedulingRetries {
  reason := fmt.Sprintf("max scheduling retries exceeded (attempts=%d)", attempts)
  _ = e.setJobState(jobID, JobStateFailed)
  _ = e.emitDLQWithRetry(jobID, topic, pb.JobStatus_JOB_STATUS_FAILED, reason, "max_scheduling_retries")
  return nil
}

// Test: below cap still retries
jobStore.attempts["job-retry"] = maxSchedulingRetries - 1
err := engine.processJob(ctx, req, "trace-retry")
if _, ok := err.(*retryableError); !ok {
  t.Fatalf("expected retryableError")
}

Validation runbook

Use a tiny synthetic topic to test behavior before changing production retry semantics.

no-pool-mapping-runbook.sh
bash
# 1) Reproduce mapping miss on a test topic
JOB_ID=$(cordumctl job submit --topic job.unmapped.demo --prompt "ping")
cordumctl job status "$JOB_ID" --json

# 2) Verify pool topology and mappings
cordumctl pool list
cordumctl pool get default

# 3) Add mapping and confirm new jobs recover
cordumctl pool topic add default job.unmapped.demo
JOB_ID2=$(cordumctl job submit --topic job.unmapped.demo --prompt "ping after mapping")
cordumctl job status "$JOB_ID2" --json

# 4) If original job is already in DLQ, replay it
cordumctl dlq retry "$JOB_ID"

Limitations and tradeoffs

ApproachUpsideDownside
Always retry `no_pool_mapping` (current)Tolerates short config propagation gaps and bootstrap races.Adds latency and hides permanent misconfiguration behind retry noise.
Immediate fail-fast DLQFast operator signal and cleaner root-cause attribution.Can create noisy DLQ bursts during normal rollout sequencing.
Hybrid split policyRetries only suspected transient mapping races; permanent mapping faults fail fast.Needs finer error taxonomy and a state signal (config epoch/hash) in scheduler path.
  • - This analysis is code-path based. It does not include live fleet percentiles for mapping-race duration.
  • - If your deployment model applies pool overlays asynchronously, immediate fail-fast may create avoidable DLQ churn.
  • - If your mapping is static and operator-managed, long retries delay useful alerts.

Next step

Implement this next:

  1. 1. Split `ErrNoPoolMapping` into finer classes: topic-missing, preferred-pool-missing, requires-unsatisfied.
  2. 2. Mark truly permanent classes as non-retryable so DLQ reason stays `no_pool_mapping` at first failure.
  3. 3. Keep a short configurable grace retry window for rollout races, then fail fast.
  4. 4. Align docs (`CORE.md`, `SCHEDULER_POOL_SPEC.md`) with actual runtime policy and retry budgets.

Continue with AI Agent Retry Intent Propagation and AI Agent Worker Pool Draining.

Policy clarity beats retry folklore

Decide which scheduler errors are transient, encode that decision in code and docs, then test it on every release.