The production problem
Your queue backlog grows. CPU looks fine. Workers are alive. Yet dispatch throughput drops hard.
One common cause is `MaxAckPending` saturation. The broker pauses delivery because too many messages are unacked, even though workers could continue if budget were sized correctly.
Oversizing is not free either. Huge in-flight windows increase memory pressure, amplify blast radius during failures, and make redelivery storms harder to reason about.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS JetStream consumers docs | Defines `MaxAckPending`, default behavior, and the delivery suspension rule when the limit is reached. | Does not map field-level semantics to scheduler-level capacity math or rollout tuning steps. |
| Configuring NATS Server docs | Server and account `max_ack_pending` limits, including JetStream defaults and overrides. | No application-side guidance for reconciling broker ceilings with client/env consumer settings. |
| NATS by Example push consumers (Go) | Concrete push-consumer behavior showing `MaxAckPending` buffering and default pending window. | No production runbook for poison-message pressure, redelivery caps, or multi-replica tuning. |
The docs explain queue controls individually. The gap is the integrated tuning path for an AI agent control plane where ack budget, retry behavior, scheduling state, and broker-side caps all interact.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Default budget | `natsMaxAckPending()` returns 2048 when env override is missing or invalid. | Baseline throughput is bounded by 2048 unacked messages per consumer. |
| Env override | `NATS_MAX_ACK_PENDING` accepts positive integers. | Operators can raise/lower pending budget without code changes. |
| Hard cap | Values above `50000` are clamped and warning-logged. | Prevents runaway memory/pressure from unsafe budget values. |
| Ack semantics | JetStream subscriptions use `ManualAck`, `AckExplicit`, `AckWait`, and `MaxAckPending` together. | Pending-ack budget directly controls delivery suspension and backlog growth. |
| Poison protection | `MaxDeliver(maxJSRedeliveries)` with `maxJSRedeliveries=100`. | Redelivery storms are bounded so one bad payload does not block the consumer forever. |
| Queue-group mode | Queue subscriptions still deliver each message to one consumer replica. | Sizing must be per consumer, then multiplied by active consumers. |
| Broker-side ceiling | NATS server/account `max_ack_pending` limits can bound the effective pending-ack budget regardless of app defaults. | Your usable window is the minimum of app value, account limit, and server limit. |
Code-level mechanics
Budget defaults and clamp (Go)
const maxAckPendingHardLimit = 50000
func natsMaxAckPending() int {
const defaultMaxAckPending = 2048
if raw := strings.TrimSpace(os.Getenv("NATS_MAX_ACK_PENDING")); raw != "" {
if v, err := strconv.Atoi(raw); err == nil && v > 0 {
if v > maxAckPendingHardLimit {
slog.Warn("bus: NATS_MAX_ACK_PENDING exceeds hard limit, clamping",
"requested", v, "clamped_to", maxAckPendingHardLimit)
return maxAckPendingHardLimit
}
return v
}
}
return defaultMaxAckPending
}JetStream subscription options (Go)
opts := []nats.SubOpt{
nats.ManualAck(),
nats.AckExplicit(),
nats.AckWait(b.ackWait),
nats.MaxAckPending(natsMaxAckPending()),
nats.MaxDeliver(maxJSRedeliveries), // 100
}`MaxAckPending` does not act alone. It should be tuned while watching `AckWait` and delivery caps, otherwise you move the bottleneck instead of removing it.
Guardrail tests in repo (Go)
func TestNatsMaxAckPending_ClampedAtHardLimit(t *testing.T) {
t.Setenv("NATS_MAX_ACK_PENDING", "100000")
v := natsMaxAckPending()
if v != maxAckPendingHardLimit {
t.Fatalf("expected %d (clamped), got %d", maxAckPendingHardLimit, v)
}
}
func TestQueueGroupWithJetStream(t *testing.T) {
// verifies one delivery per message across queue consumers
}Sizing heuristic
Start from observed reality, not round numbers. Measure per-consumer handler latency and sustained ingress rate, then compute an initial budget and reconcile it with account/server ceilings.
# Rule of thumb (per consumer): # required_budget ~= p95_handler_seconds * incoming_msgs_per_second * safety_factor # # Example: # p95_handler = 1.2s # rate = 600 msg/s # safety_factor = 2 # required_budget ~= 1.2 * 600 * 2 = 1440 # # Effective budget is capped by broker limits: # effective_budget = min(consumer_max_ack_pending, account_max_ack_pending, server_max_ack_pending) # # Start with 2048, verify broker-side caps, then tune from production telemetry.
If requested values exceed broker limits, effective budget stays capped and symptoms will look like under-sizing. After initial sizing, tune in small steps and watch queue delay, redelivery rate, and scheduler memory headroom together.
Operator runbook
# 0) Check broker-side max_ack_pending limits (server/account) kubectl -n nats get configmap nats-config -o yaml | rg max_ack_pending # 1) Check current override kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_MAX_ACK_PENDING # 2) Inspect active JetStream consumers and pending-ack pressure nats consumer ls CORDUM_JOBS nats consumer info CORDUM_JOBS <consumer_name> # 3) Watch signals # - NumAckPending close to MaxAckPending for sustained periods # - Rising delivery lag / backlog # - Repeated redeliveries for same payload # 4) Tune and roll kubectl -n cordum set env deploy/cordum-scheduler NATS_MAX_ACK_PENDING=5000 kubectl -n cordum rollout status deploy/cordum-scheduler # 5) Re-check after 15-30 minutes # - Ack pending headroom restored # - No large memory regression # - Redelivery rate stable
Limitations and tradeoffs
- - Lower budget improves failure visibility but can throttle healthy workloads.
- - Higher budget improves short-term throughput but can hide slow-consumer problems.
- - MaxAckPending tuning does not fix non-idempotent worker behavior.
- - Queue-level controls should be paired with DLQ policy and replay discipline.
If `NumAckPending` stays near max for long windows, do not just raise the ceiling. First verify handler latency, stuck deliveries, and poison-message rate.
Next step
Run one sizing pass with production data:
- 1. Measure p95 handler duration per queue consumer.
- 2. Estimate required ack budget with a safety factor.
- 3. Apply one-step change and observe for 30 minutes.
- 4. Keep the smallest value that avoids sustained saturation.
Continue with AckWait and Dedup TTL Alignment and Backpressure and Queue Drain.