The production problem
Backlog is not the incident. Uncontrolled backlog growth is the incident. Teams usually notice it only after retries stack, workers saturate, and latency SLOs collapse.
In autonomous systems, overload is costlier. One overloaded queue can trigger duplicate side effects, delayed approvals, and cascading policy violations from stale retries.
Queue depth alone is insufficient. You need admission control, explicit overload reason codes, and a deterministic drain path.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| RabbitMQ Flow Control | Clear broker behavior under pressure and producer throttling semantics. | No policy-aware queue drain strategy for autonomous side effects. |
| Google Pub/Sub flow control for subscribers | Client-side flow controls to limit outstanding messages/bytes. | No governance model for dispatch shedding in agent control planes. |
| Apache Kafka monitoring | Concrete lag and buffer-wait metrics (`records-lag-max`, buffer pool wait signals). | No run-level decision framework for overloaded autonomous workflows. |
Backpressure model
Stable drain behavior requires layered decisions. Each layer should reduce pressure before the next layer is forced to compensate.
| Layer | Required rule | Failure if missing |
|---|---|---|
| Ingress admission | Throttle or defer before worker pools cross overload threshold. | Queue growth outruns dispatch capacity. |
| Dispatch routing | Detect `no_workers` and `pool_overloaded` as explicit reason codes. | Blind retries amplify load with no new capacity. |
| Retry budget | Use bounded retries with backoff windows and terminal DLQ path. | Infinite retry loops consume resources and hide root cause. |
| Drain governance | Replay only with idempotency and policy checks still active. | Backlog cleanup creates duplicate or unsafe side effects. |
Cordum runtime implications
| Implication | Current behavior | Why it matters |
|---|---|---|
| Overload detection | Worker is overloaded at >=90% parallel-job utilization or CPU/GPU >=90 | Dispatch can classify pressure early and avoid unstable worker saturation. |
| Retry boundaries | Max scheduling retries is 50 with exponential 1s-30s backoff | Bounded retries cap damage before terminal DLQ handling. |
| No-worker cooldown | `retryDelayNoWorkers` is 2s for no-capacity branches | Prevents hot-loop retry storms while waiting for capacity recovery. |
| Bus durability semantics | JetStream at-least-once delivery with AckWait 10m and MaxDeliver 100 | Drain logic must be idempotent under redelivery and delayed acks. |
| Redelivery-safe handlers | Handlers use Redis lock + retryable NAK pattern | Queue drain keeps correctness under transient store and lock failures. |
Translation: keep queues moving by design, not by retry noise. Overload should route into predictable states and recoverable operations.
Implementation examples
Drain admission function (Go)
type DrainAction string
const (
DrainDispatch DrainAction = "dispatch"
DrainDefer DrainAction = "defer"
DrainShed DrainAction = "shed"
)
func chooseDrainAction(activeJobs int, maxParallel int, lag int, maxLag int) DrainAction {
utilization := float64(activeJobs) / float64(maxParallel)
if utilization >= 0.90 {
if lag > maxLag {
return DrainShed
}
return DrainDefer
}
return DrainDispatch
}Backpressure thresholds (YAML)
backpressure:
overload_threshold:
worker_utilization: 0.90
cpu_percent: 90
gpu_percent: 90
retry_budget:
max_attempts: 50
backoff_base: 1s
backoff_max: 30s
no_worker_delay: 2sDrain decision audit record (JSON)
{
"topic": "job.remediation.execute",
"queue_depth": 1840,
"records_lag_max": 12600,
"dispatch_decision": "defer",
"reason_code": "pool_overloaded",
"next_retry_sec": 2,
"policy_checked": true,
"idempotency_required": true
}Limitations and tradeoffs
- - Aggressive shedding protects workers but can increase tail latency for non-critical jobs.
- - Conservative retry budgets reduce duplicate work but may under-recover transient outages.
- - Admission controls require accurate capacity signals; stale metrics cause poor decisions.
- - Drain automation still needs manual guardrails for high-risk external actions.
Next step
Run this in one sprint:
- 1. Define overload signals (`queue_depth`, `records_lag_max`, utilization) per critical topic.
- 2. Set admission decisions for each topic: dispatch, defer, or shed.
- 3. Cap retry budgets and map terminal overload failures to explicit reason codes.
- 4. Simulate one overload game day and verify drain behavior end-to-end.
Continue with AI Agent Rate Limiting and Overload Control and AI Agent Poison Message Handling.