The production problem
Your queue backlog grows. CPU looks fine. Workers are alive. Yet dispatch throughput drops hard.
One common cause is `MaxAckPending` saturation. The broker pauses delivery because too many messages are unacked, even though workers could continue if budget were sized correctly.
Oversizing is not free either. Huge in-flight windows increase memory pressure, amplify blast radius during failures, and make redelivery storms harder to reason about.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS JetStream consumers | Consumer controls including `MaxAckPending`, `AckWait`, and delivery behavior. | No scheduler-specific sizing workflow tied to handler latency and control-plane state transitions. |
| RabbitMQ consumer prefetch | In-flight message caps and flow-control effect on consumer throughput. | No JetStream ack-redelivery semantics or mixed retry/poison handling in scheduler workloads. |
| Amazon SQS queue quotas | In-flight message limits and symptoms when capacity is exhausted. | No per-consumer ack-budget tuning strategy for broker+app combined failure paths. |
The docs explain queue controls individually. The gap is the integrated tuning path for an AI agent control plane where ack budget, retry behavior, and scheduling state all interact.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Default budget | `natsMaxAckPending()` returns 2048 when env override is missing or invalid. | Baseline throughput is bounded by 2048 unacked messages per consumer. |
| Env override | `NATS_MAX_ACK_PENDING` accepts positive integers. | Operators can raise/lower pending budget without code changes. |
| Hard cap | Values above `50000` are clamped and warning-logged. | Prevents runaway memory/pressure from unsafe budget values. |
| Ack semantics | JetStream subscriptions use `ManualAck`, `AckExplicit`, `AckWait`, and `MaxAckPending` together. | Pending-ack budget directly controls delivery suspension and backlog growth. |
| Poison protection | `MaxDeliver(maxJSRedeliveries)` with `maxJSRedeliveries=100`. | Redelivery storms are bounded so one bad payload does not block the consumer forever. |
| Queue-group mode | Queue subscriptions still deliver each message to one consumer replica. | Sizing must be per consumer, then multiplied by active consumers. |
Code-level mechanics
Budget defaults and clamp (Go)
const maxAckPendingHardLimit = 50000
func natsMaxAckPending() int {
const defaultMaxAckPending = 2048
if raw := strings.TrimSpace(os.Getenv("NATS_MAX_ACK_PENDING")); raw != "" {
if v, err := strconv.Atoi(raw); err == nil && v > 0 {
if v > maxAckPendingHardLimit {
slog.Warn("bus: NATS_MAX_ACK_PENDING exceeds hard limit, clamping",
"requested", v, "clamped_to", maxAckPendingHardLimit)
return maxAckPendingHardLimit
}
return v
}
}
return defaultMaxAckPending
}JetStream subscription options (Go)
opts := []nats.SubOpt{
nats.ManualAck(),
nats.AckExplicit(),
nats.AckWait(b.ackWait),
nats.MaxAckPending(natsMaxAckPending()),
nats.MaxDeliver(maxJSRedeliveries), // 100
}`MaxAckPending` does not act alone. It should be tuned while watching `AckWait` and delivery caps, otherwise you move the bottleneck instead of removing it.
Guardrail tests in repo (Go)
func TestNatsMaxAckPending_ClampedAtHardLimit(t *testing.T) {
t.Setenv("NATS_MAX_ACK_PENDING", "100000")
v := natsMaxAckPending()
if v != maxAckPendingHardLimit {
t.Fatalf("expected %d (clamped), got %d", maxAckPendingHardLimit, v)
}
}
func TestQueueGroupWithJetStream(t *testing.T) {
// verifies one delivery per message across queue consumers
}Sizing heuristic
Start from observed reality, not round numbers. Measure per-consumer handler latency and sustained ingress rate, then compute an initial budget.
# Rule of thumb (per consumer): # required_budget ~= p95_handler_seconds * incoming_msgs_per_second * safety_factor # # Example: # p95_handler = 1.2s # rate = 600 msg/s # safety_factor = 2 # required_budget ~= 1.2 * 600 * 2 = 1440 # # Start with 2048, then tune from production telemetry.
After initial sizing, tune in small steps and watch queue delay, redelivery rate, and scheduler memory headroom together.
Operator runbook
# 1) Check current override kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_MAX_ACK_PENDING # 2) Inspect active JetStream consumers and pending-ack pressure nats consumer ls CORDUM_JOBS nats consumer info CORDUM_JOBS <consumer_name> # 3) Watch signals # - NumAckPending close to MaxAckPending for sustained periods # - Rising delivery lag / backlog # - Repeated redeliveries for same payload # 4) Tune and roll kubectl -n cordum set env deploy/cordum-scheduler NATS_MAX_ACK_PENDING=5000 kubectl -n cordum rollout status deploy/cordum-scheduler # 5) Re-check after 15-30 minutes # - Ack pending headroom restored # - No large memory regression # - Redelivery rate stable
Limitations and tradeoffs
- - Lower budget improves failure visibility but can throttle healthy workloads.
- - Higher budget improves short-term throughput but can hide slow-consumer problems.
- - MaxAckPending tuning does not fix non-idempotent worker behavior.
- - Queue-level controls should be paired with DLQ policy and replay discipline.
If `NumAckPending` stays near max for long windows, do not just raise the ceiling. First verify handler latency, stuck deliveries, and poison-message rate.
Next step
Run one sizing pass with production data:
- 1. Measure p95 handler duration per queue consumer.
- 2. Estimate required ack budget with a safety factor.
- 3. Apply one-step change and observe for 30 minutes.
- 4. Keep the smallest value that avoids sustained saturation.
Continue with AckWait and Dedup TTL Alignment and Backpressure and Queue Drain.