Name: Cordum
Author: Cordum

The production problem

Backlog is not the incident. Uncontrolled backlog growth is the incident. Teams usually notice it only after retries stack, workers saturate, and latency SLOs collapse.

In autonomous systems, overload is costlier. One overloaded queue can trigger duplicate side effects, delayed approvals, and cascading policy violations from stale retries.

Queue depth alone is insufficient. You need admission control, explicit overload reason codes, and a deterministic drain path.

What top results miss

Source	Strong coverage	Missing piece
RabbitMQ Flow Control	Clear broker behavior under pressure and producer throttling semantics.	No policy-aware queue drain strategy for autonomous side effects.
Google Pub/Sub flow control for subscribers	Client-side flow controls to limit outstanding messages/bytes.	No governance model for dispatch shedding in agent control planes.
Apache Kafka monitoring	Concrete lag and buffer-wait metrics (`records-lag-max`, buffer pool wait signals).	No run-level decision framework for overloaded autonomous workflows.

Backpressure model

Stable drain behavior requires layered decisions. Each layer should reduce pressure before the next layer is forced to compensate.

Layer	Required rule	Failure if missing
Ingress admission	Throttle or defer before worker pools cross overload threshold.	Queue growth outruns dispatch capacity.
Dispatch routing	Detect `no_workers` and `pool_overloaded` as explicit reason codes.	Blind retries amplify load with no new capacity.
Retry budget	Use bounded retries with backoff windows and terminal DLQ path.	Infinite retry loops consume resources and hide root cause.
Drain governance	Replay only with idempotency and policy checks still active.	Backlog cleanup creates duplicate or unsafe side effects.

Cordum runtime implications

Implication	Current behavior	Why it matters
Overload detection	Worker is overloaded at >=90% parallel-job utilization or CPU/GPU >=90	Dispatch can classify pressure early and avoid unstable worker saturation.
Retry boundaries	Max scheduling retries is 50 with exponential 1s-30s backoff	Bounded retries cap damage before terminal DLQ handling.
No-worker cooldown	`retryDelayNoWorkers` is 2s for no-capacity branches	Prevents hot-loop retry storms while waiting for capacity recovery.
Bus durability semantics	JetStream at-least-once delivery with AckWait 10m and MaxDeliver 100	Drain logic must be idempotent under redelivery and delayed acks.
Redelivery-safe handlers	Handlers use Redis lock + retryable NAK pattern	Queue drain keeps correctness under transient store and lock failures.

Translation: keep queues moving by design, not by retry noise. Overload should route into predictable states and recoverable operations.

Implementation examples

Drain admission function (Go)

admission.go

type DrainAction string

const (
  DrainDispatch DrainAction = "dispatch"
  DrainDefer    DrainAction = "defer"
  DrainShed     DrainAction = "shed"
)

func chooseDrainAction(activeJobs int, maxParallel int, lag int, maxLag int) DrainAction {
  utilization := float64(activeJobs) / float64(maxParallel)

  if utilization >= 0.90 {
    if lag > maxLag {
      return DrainShed
    }
    return DrainDefer
  }

  return DrainDispatch
}

Backpressure thresholds (YAML)

backpressure.yaml

YAML

backpressure:
  overload_threshold:
    worker_utilization: 0.90
    cpu_percent: 90
    gpu_percent: 90
  retry_budget:
    max_attempts: 50
    backoff_base: 1s
    backoff_max: 30s
  no_worker_delay: 2s

Drain decision audit record (JSON)

drain-audit.json

JSON

{
  "topic": "job.remediation.execute",
  "queue_depth": 1840,
  "records_lag_max": 12600,
  "dispatch_decision": "defer",
  "reason_code": "pool_overloaded",
  "next_retry_sec": 2,
  "policy_checked": true,
  "idempotency_required": true
}

Limitations and tradeoffs

- Aggressive shedding protects workers but can increase tail latency for non-critical jobs.
- Conservative retry budgets reduce duplicate work but may under-recover transient outages.
- Admission controls require accurate capacity signals; stale metrics cause poor decisions.
- Drain automation still needs manual guardrails for high-risk external actions.

Next step

Run this in one sprint:

1. Define overload signals (`queue_depth`, `records_lag_max`, utilization) per critical topic.
2. Set admission decisions for each topic: dispatch, defer, or shed.
3. Cap retry budgets and map terminal overload failures to explicit reason codes.
4. Simulate one overload game day and verify drain behavior end-to-end.

Continue with AI Agent Rate Limiting and Overload Control and AI Agent Poison Message Handling.

AI Agent Backpressure and Queue Drain Strategy