Name: Cordum
Author: Cordum

The production problem

`no_pool_mapping` is often a configuration error.

Configuration errors are usually permanent until an operator changes state.

Permanent errors and transient errors should not share the same retry policy. If they do, queue time gets burned and incident triage slows down.

In Cordum today, this class is retryable. That choice is defensible during rollout races, but expensive during static misconfiguration.

What top results cover and miss

Source	Strong coverage	Missing piece
AWS Builders' Library: Timeouts, Retries, and Backoff with Jitter	Why retries amplify load and why jitter is required to prevent retry storms.	No scheduler-level guidance for config errors like missing topic-to-pool mappings.
Azure Architecture: Retry Pattern	Clear split between transient and non-transient faults with bounded retry strategy.	No concrete mapping for queue dispatch failures where fault type changes over rollout time.
gRPC Status Codes	`FAILED_PRECONDITION` guidance: do not retry until system state is explicitly fixed.	No direct model for message-bus schedulers that must choose between requeue and DLQ semantics.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Error origin	`LeastLoadedStrategy.PickSubject` returns `ErrNoPoolMapping` for unmapped topic, invalid `preferred_pool`, or unsatisfied `requires`.	Multiple configuration states collapse into one scheduler error class.
Retry classification	`isRetryableSchedulingError` includes `ErrNoPoolMapping` with worker-capacity errors.	Mapping misses follow the same retry path as transient `no_workers` and `pool_overloaded` scenarios.
Backoff policy	Retry delay uses `base=1s`, `max=30s`, jitter `<500ms`, exponential growth, and attempt cap `50`.	Good storm control. Slower operator feedback for truly permanent misconfiguration.
Terminal outcome	At `attempts >= 50`, scheduler sets FAILED and emits DLQ error code `max_scheduling_retries`.	Terminal record can obscure the original `no_pool_mapping` trigger.
Docs alignment	Some docs pages still describe `no_pool_mapping` as immediate fail-fast DLQ behavior.	Runbooks may assume behavior that no longer matches runtime.

Scheduler code paths

Where `ErrNoPoolMapping` is emitted

core/controlplane/scheduler/strategy_least_loaded.go

// core/controlplane/scheduler/strategy_least_loaded.go (excerpt)
if poolHint != "" {
  if !containsPool(topicPools, poolHint) {
    return "", fmt.Errorf("%w: preferred pool %q not mapped for topic %q", ErrNoPoolMapping, poolHint, req.Topic)
  }
}
if len(topicPools) == 0 {
  return "", fmt.Errorf("%w: topic %q", ErrNoPoolMapping, req.Topic)
}
eligiblePools := filterEligiblePools(topicPools, jobRequires, routing.Pools)
if len(eligiblePools) == 0 {
  return "", fmt.Errorf("%w: no pool satisfies requires", ErrNoPoolMapping)
}

Why it gets retried

core/controlplane/scheduler/engine.go + backoff.go

// core/controlplane/scheduler/engine.go + backoff.go (excerpt)
func isRetryableSchedulingError(err error) bool {
  if errors.Is(err, ErrNoWorkers) ||
     errors.Is(err, ErrPoolOverloaded) ||
     errors.Is(err, ErrTenantLimit) ||
     errors.Is(err, ErrNoPoolMapping) {
    return true
  }
  return false
}

const (
  backoffBase      = 1 * time.Second
  backoffMax       = 30 * time.Second
  backoffJitterMax = 500 * time.Millisecond
  maxSchedulingRetries = 50
)

How it terminates

core/controlplane/scheduler/engine.go + engine_test.go

// core/controlplane/scheduler/engine.go + engine_test.go (excerpt)
if attempts >= maxSchedulingRetries {
  reason := fmt.Sprintf("max scheduling retries exceeded (attempts=%d)", attempts)
  _ = e.setJobState(jobID, JobStateFailed)
  _ = e.emitDLQWithRetry(jobID, topic, pb.JobStatus_JOB_STATUS_FAILED, reason, "max_scheduling_retries")
  return nil
}

// Test: below cap still retries
jobStore.attempts["job-retry"] = maxSchedulingRetries - 1
err := engine.processJob(ctx, req, "trace-retry")
if _, ok := err.(*retryableError); !ok {
  t.Fatalf("expected retryableError")
}

Validation runbook

Use a tiny synthetic topic to test behavior before changing production retry semantics.

no-pool-mapping-runbook.sh

bash

# 1) Reproduce mapping miss on a test topic
JOB_ID=$(cordumctl job submit --topic job.unmapped.demo --prompt "ping")
cordumctl job status "$JOB_ID" --json

# 2) Verify pool topology and mappings
cordumctl pool list
cordumctl pool get default

# 3) Add mapping and confirm new jobs recover
cordumctl pool topic add default job.unmapped.demo
JOB_ID2=$(cordumctl job submit --topic job.unmapped.demo --prompt "ping after mapping")
cordumctl job status "$JOB_ID2" --json

# 4) If original job is already in DLQ, replay it
cordumctl dlq retry "$JOB_ID"

Limitations and tradeoffs

Approach	Upside	Downside
Always retry `no_pool_mapping` (current)	Tolerates short config propagation gaps and bootstrap races.	Adds latency and hides permanent misconfiguration behind retry noise.
Immediate fail-fast DLQ	Fast operator signal and cleaner root-cause attribution.	Can create noisy DLQ bursts during normal rollout sequencing.
Hybrid split policy	Retries only suspected transient mapping races; permanent mapping faults fail fast.	Needs finer error taxonomy and a state signal (config epoch/hash) in scheduler path.

- This analysis is code-path based. It does not include live fleet percentiles for mapping-race duration.
- If your deployment model applies pool overlays asynchronously, immediate fail-fast may create avoidable DLQ churn.
- If your mapping is static and operator-managed, long retries delay useful alerts.

Next step

Implement this next:

1. Split `ErrNoPoolMapping` into finer classes: topic-missing, preferred-pool-missing, requires-unsatisfied.
2. Mark truly permanent classes as non-retryable so DLQ reason stays `no_pool_mapping` at first failure.
3. Keep a short configurable grace retry window for rollout races, then fail fast.
4. Align docs (`CORE.md`, `SCHEDULER_POOL_SPEC.md`) with actual runtime policy and retry budgets.

Continue with AI Agent Retry Intent Propagation and AI Agent Worker Pool Draining.

AI Agent `no_pool_mapping` Retry Policy