Skip to content
Deep Dive

AI Agent MaxAckPending Tuning

If pending-ack budget is wrong, delivery pauses before your workers are actually saturated.

Deep Dive10 min readMar 2026
TL;DR
  • -Low `MaxAckPending` can suspend delivery under load even when workers are healthy.
  • -Cordum defaults to `NATS_MAX_ACK_PENDING=2048` equivalent and clamps overrides at `50000`.
  • -`MaxDeliver=100` prevents poison messages from permanently occupying ack budget.
  • -Tuning should be based on observed handler latency and per-consumer throughput, not guesswork.
Default budget

JetStream consumer budget defaults to 2048 pending acks per consumer in current Cordum bus code.

Hard clamp

Environment override above 50000 is clamped to a hard limit, then logged with a warning.

Failure mode

Poison messages can consume pending-ack slots; MaxDeliver cap limits queue starvation blast radius.

Scope

This guide focuses on JetStream consumer flow control in the Cordum bus layer. It does not replace worker idempotency or job-level retry policy design.

The production problem

Your queue backlog grows. CPU looks fine. Workers are alive. Yet dispatch throughput drops hard.

One common cause is `MaxAckPending` saturation. The broker pauses delivery because too many messages are unacked, even though workers could continue if budget were sized correctly.

Oversizing is not free either. Huge in-flight windows increase memory pressure, amplify blast radius during failures, and make redelivery storms harder to reason about.

What top results miss

SourceStrong coverageMissing piece
NATS JetStream consumersConsumer controls including `MaxAckPending`, `AckWait`, and delivery behavior.No scheduler-specific sizing workflow tied to handler latency and control-plane state transitions.
RabbitMQ consumer prefetchIn-flight message caps and flow-control effect on consumer throughput.No JetStream ack-redelivery semantics or mixed retry/poison handling in scheduler workloads.
Amazon SQS queue quotasIn-flight message limits and symptoms when capacity is exhausted.No per-consumer ack-budget tuning strategy for broker+app combined failure paths.

The docs explain queue controls individually. The gap is the integrated tuning path for an AI agent control plane where ack budget, retry behavior, and scheduling state all interact.

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Default budget`natsMaxAckPending()` returns 2048 when env override is missing or invalid.Baseline throughput is bounded by 2048 unacked messages per consumer.
Env override`NATS_MAX_ACK_PENDING` accepts positive integers.Operators can raise/lower pending budget without code changes.
Hard capValues above `50000` are clamped and warning-logged.Prevents runaway memory/pressure from unsafe budget values.
Ack semanticsJetStream subscriptions use `ManualAck`, `AckExplicit`, `AckWait`, and `MaxAckPending` together.Pending-ack budget directly controls delivery suspension and backlog growth.
Poison protection`MaxDeliver(maxJSRedeliveries)` with `maxJSRedeliveries=100`.Redelivery storms are bounded so one bad payload does not block the consumer forever.
Queue-group modeQueue subscriptions still deliver each message to one consumer replica.Sizing must be per consumer, then multiplied by active consumers.

Code-level mechanics

Budget defaults and clamp (Go)

core/infra/bus/nats.go
Go
const maxAckPendingHardLimit = 50000

func natsMaxAckPending() int {
  const defaultMaxAckPending = 2048
  if raw := strings.TrimSpace(os.Getenv("NATS_MAX_ACK_PENDING")); raw != "" {
    if v, err := strconv.Atoi(raw); err == nil && v > 0 {
      if v > maxAckPendingHardLimit {
        slog.Warn("bus: NATS_MAX_ACK_PENDING exceeds hard limit, clamping",
          "requested", v, "clamped_to", maxAckPendingHardLimit)
        return maxAckPendingHardLimit
      }
      return v
    }
  }
  return defaultMaxAckPending
}

JetStream subscription options (Go)

core/infra/bus/nats.go
Go
opts := []nats.SubOpt{
  nats.ManualAck(),
  nats.AckExplicit(),
  nats.AckWait(b.ackWait),
  nats.MaxAckPending(natsMaxAckPending()),
  nats.MaxDeliver(maxJSRedeliveries), // 100
}

`MaxAckPending` does not act alone. It should be tuned while watching `AckWait` and delivery caps, otherwise you move the bottleneck instead of removing it.

Guardrail tests in repo (Go)

core/infra/bus/nats_test.go
Go
func TestNatsMaxAckPending_ClampedAtHardLimit(t *testing.T) {
  t.Setenv("NATS_MAX_ACK_PENDING", "100000")
  v := natsMaxAckPending()
  if v != maxAckPendingHardLimit {
    t.Fatalf("expected %d (clamped), got %d", maxAckPendingHardLimit, v)
  }
}

func TestQueueGroupWithJetStream(t *testing.T) {
  // verifies one delivery per message across queue consumers
}

Sizing heuristic

Start from observed reality, not round numbers. Measure per-consumer handler latency and sustained ingress rate, then compute an initial budget.

maxackpending_sizing.txt
Bash
# Rule of thumb (per consumer):
# required_budget ~= p95_handler_seconds * incoming_msgs_per_second * safety_factor
#
# Example:
# p95_handler = 1.2s
# rate = 600 msg/s
# safety_factor = 2
# required_budget ~= 1.2 * 600 * 2 = 1440
#
# Start with 2048, then tune from production telemetry.

After initial sizing, tune in small steps and watch queue delay, redelivery rate, and scheduler memory headroom together.

Operator runbook

maxackpending_runbook.sh
Bash
# 1) Check current override
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_MAX_ACK_PENDING

# 2) Inspect active JetStream consumers and pending-ack pressure
nats consumer ls CORDUM_JOBS
nats consumer info CORDUM_JOBS <consumer_name>

# 3) Watch signals
# - NumAckPending close to MaxAckPending for sustained periods
# - Rising delivery lag / backlog
# - Repeated redeliveries for same payload

# 4) Tune and roll
kubectl -n cordum set env deploy/cordum-scheduler NATS_MAX_ACK_PENDING=5000
kubectl -n cordum rollout status deploy/cordum-scheduler

# 5) Re-check after 15-30 minutes
# - Ack pending headroom restored
# - No large memory regression
# - Redelivery rate stable

Limitations and tradeoffs

  • - Lower budget improves failure visibility but can throttle healthy workloads.
  • - Higher budget improves short-term throughput but can hide slow-consumer problems.
  • - MaxAckPending tuning does not fix non-idempotent worker behavior.
  • - Queue-level controls should be paired with DLQ policy and replay discipline.

If `NumAckPending` stays near max for long windows, do not just raise the ceiling. First verify handler latency, stuck deliveries, and poison-message rate.

Next step

Run one sizing pass with production data:

  1. 1. Measure p95 handler duration per queue consumer.
  2. 2. Estimate required ack budget with a safety factor.
  3. 3. Apply one-step change and observe for 30 minutes.
  4. 4. Keep the smallest value that avoids sustained saturation.

Continue with AckWait and Dedup TTL Alignment and Backpressure and Queue Drain.

Queue health first

Throughput tuning is easy to overdo. The goal is stable delivery under failure, not a benchmark screenshot.