Name: Cordum
Author: Cordum

The production problem

Your queue backlog grows. CPU looks fine. Workers are alive. Yet dispatch throughput drops hard.

One common cause is `MaxAckPending` saturation. The broker pauses delivery because too many messages are unacked, even though workers could continue if budget were sized correctly.

Oversizing is not free either. Huge in-flight windows increase memory pressure, amplify blast radius during failures, and make redelivery storms harder to reason about.

What top results miss

Source	Strong coverage	Missing piece
NATS JetStream consumers	Consumer controls including `MaxAckPending`, `AckWait`, and delivery behavior.	No scheduler-specific sizing workflow tied to handler latency and control-plane state transitions.
RabbitMQ consumer prefetch	In-flight message caps and flow-control effect on consumer throughput.	No JetStream ack-redelivery semantics or mixed retry/poison handling in scheduler workloads.
Amazon SQS queue quotas	In-flight message limits and symptoms when capacity is exhausted.	No per-consumer ack-budget tuning strategy for broker+app combined failure paths.

The docs explain queue controls individually. The gap is the integrated tuning path for an AI agent control plane where ack budget, retry behavior, and scheduling state all interact.

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Default budget	`natsMaxAckPending()` returns 2048 when env override is missing or invalid.	Baseline throughput is bounded by 2048 unacked messages per consumer.
Env override	`NATS_MAX_ACK_PENDING` accepts positive integers.	Operators can raise/lower pending budget without code changes.
Hard cap	Values above `50000` are clamped and warning-logged.	Prevents runaway memory/pressure from unsafe budget values.
Ack semantics	JetStream subscriptions use `ManualAck`, `AckExplicit`, `AckWait`, and `MaxAckPending` together.	Pending-ack budget directly controls delivery suspension and backlog growth.
Poison protection	`MaxDeliver(maxJSRedeliveries)` with `maxJSRedeliveries=100`.	Redelivery storms are bounded so one bad payload does not block the consumer forever.
Queue-group mode	Queue subscriptions still deliver each message to one consumer replica.	Sizing must be per consumer, then multiplied by active consumers.

Code-level mechanics

Budget defaults and clamp (Go)

core/infra/bus/nats.go

const maxAckPendingHardLimit = 50000

func natsMaxAckPending() int {
  const defaultMaxAckPending = 2048
  if raw := strings.TrimSpace(os.Getenv("NATS_MAX_ACK_PENDING")); raw != "" {
    if v, err := strconv.Atoi(raw); err == nil && v > 0 {
      if v > maxAckPendingHardLimit {
        slog.Warn("bus: NATS_MAX_ACK_PENDING exceeds hard limit, clamping",
          "requested", v, "clamped_to", maxAckPendingHardLimit)
        return maxAckPendingHardLimit
      }
      return v
    }
  }
  return defaultMaxAckPending
}

JetStream subscription options (Go)

core/infra/bus/nats.go

opts := []nats.SubOpt{
  nats.ManualAck(),
  nats.AckExplicit(),
  nats.AckWait(b.ackWait),
  nats.MaxAckPending(natsMaxAckPending()),
  nats.MaxDeliver(maxJSRedeliveries), // 100
}

`MaxAckPending` does not act alone. It should be tuned while watching `AckWait` and delivery caps, otherwise you move the bottleneck instead of removing it.

Guardrail tests in repo (Go)

core/infra/bus/nats_test.go

func TestNatsMaxAckPending_ClampedAtHardLimit(t *testing.T) {
  t.Setenv("NATS_MAX_ACK_PENDING", "100000")
  v := natsMaxAckPending()
  if v != maxAckPendingHardLimit {
    t.Fatalf("expected %d (clamped), got %d", maxAckPendingHardLimit, v)
  }
}

func TestQueueGroupWithJetStream(t *testing.T) {
  // verifies one delivery per message across queue consumers
}

Sizing heuristic

Start from observed reality, not round numbers. Measure per-consumer handler latency and sustained ingress rate, then compute an initial budget.

maxackpending_sizing.txt

Bash

# Rule of thumb (per consumer):
# required_budget ~= p95_handler_seconds * incoming_msgs_per_second * safety_factor
#
# Example:
# p95_handler = 1.2s
# rate = 600 msg/s
# safety_factor = 2
# required_budget ~= 1.2 * 600 * 2 = 1440
#
# Start with 2048, then tune from production telemetry.

After initial sizing, tune in small steps and watch queue delay, redelivery rate, and scheduler memory headroom together.

Operator runbook

maxackpending_runbook.sh

Bash

# 1) Check current override
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_MAX_ACK_PENDING

# 2) Inspect active JetStream consumers and pending-ack pressure
nats consumer ls CORDUM_JOBS
nats consumer info CORDUM_JOBS <consumer_name>

# 3) Watch signals
# - NumAckPending close to MaxAckPending for sustained periods
# - Rising delivery lag / backlog
# - Repeated redeliveries for same payload

# 4) Tune and roll
kubectl -n cordum set env deploy/cordum-scheduler NATS_MAX_ACK_PENDING=5000
kubectl -n cordum rollout status deploy/cordum-scheduler

# 5) Re-check after 15-30 minutes
# - Ack pending headroom restored
# - No large memory regression
# - Redelivery rate stable

Limitations and tradeoffs

- Lower budget improves failure visibility but can throttle healthy workloads.
- Higher budget improves short-term throughput but can hide slow-consumer problems.
- MaxAckPending tuning does not fix non-idempotent worker behavior.
- Queue-level controls should be paired with DLQ policy and replay discipline.

If `NumAckPending` stays near max for long windows, do not just raise the ceiling. First verify handler latency, stuck deliveries, and poison-message rate.

Next step

Run one sizing pass with production data:

1. Measure p95 handler duration per queue consumer.
2. Estimate required ack budget with a safety factor.
3. Apply one-step change and observe for 30 minutes.
4. Keep the smallest value that avoids sustained saturation.

Continue with AckWait and Dedup TTL Alignment and Backpressure and Queue Drain.

AI Agent MaxAckPending Tuning