AI Agent NATS Slow Consumer Guardrails (2026)

The production problem

Load increases. One subscriber falls behind. Nobody notices until downstream state looks inconsistent.

NATS behavior here is intentional. It protects the system first, and the lagging consumer second.

If your application does not expose local slow-consumer signals quickly, incident response starts late.

What top results cover and miss

Source	Strong coverage	Missing piece
NATS docs: Slow Consumers (server/operator)	Server/client slow-consumer behavior, default pending limits (65,536 messages and 65,536*1024 bytes), and tuning directions.	No control-plane-specific guidance for mixed JetStream and core NATS subject classes in one runtime.
NATS docs: Slow Consumers (developer events)	Client-side pending-limit APIs and slow-consumer detection pattern examples.	No migration checklist for teams with legacy subscriptions that were created without limits.
nats.go package docs	`ErrorHandler`, `Subscription.SetPendingLimits`, `Pending`, `Dropped`, and `StatusChanged` APIs.	No production defaults for AI control-plane workloads or guidance on per-subject budget partitioning.

Cordum runtime behavior

Cordum has strong JetStream guardrails in place. Core NATS paths are less opinionated today.

That split is easy to miss because both code paths share the same high-level bus API.

Boundary	Current behavior	Operational impact
JetStream subscribe path	Cordum sets `nats.MaxAckPending(natsMaxAckPending())` for durable subjects.	Backpressure ceiling is explicit and configurable in JetStream mode.
JetStream safety bounds	Cordum defaults `NATS_MAX_ACK_PENDING` to 2,048 and clamps user input at 50,000.	Prevents runaway buffers, but high-throughput workloads still need queue parallelism and subject partitioning.
Core NATS subscribe path	Core subscriptions use `nc.Subscribe` / `nc.QueueSubscribe` with no immediate `SetPendingLimits` call.	Slow-consumer limits stay at library defaults unless changed elsewhere.
Connection callbacks	Cordum sets disconnect/reconnect/closed handlers but not `nats.ErrorHandler`.	Async local slow-consumer events are easier to miss in operator telemetry.
Operational observability	No built-in metric surface for per-subscription dropped-count snapshots on core NATS paths.	Teams usually discover this by reading logs after user-visible lag.

Current core + connection setup excerpt

opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.DisconnectErrHandler(...),
  nats.ReconnectHandler(...),
  nats.ClosedHandler(...),
}

// Core path
if queue == "" {
  sub, err := b.nc.Subscribe(subject, cb)
  ...
}
sub, err := b.nc.QueueSubscribe(subject, queue, cb)

JetStream MaxAckPending defaults and clamp

const maxAckPendingHardLimit = 50000
const defaultMaxAckPending = 2048

func natsMaxAckPending() int {
  if v := parseIntEnv("NATS_MAX_ACK_PENDING"); v > 0 {
    if v > maxAckPendingHardLimit {
      return maxAckPendingHardLimit
    }
    return v
  }
  return defaultMaxAckPending
}

Practical patch pattern

Add two pieces together:

1) `nats.ErrorHandler` for async local slow-consumer visibility, 2) explicit `SetPendingLimits` for core subscriptions.

Start with conservative limits and tune by subject class. A single global number is usually wrong.

Guardrail patch sketch

func withSlowConsumerGuardrails(opts *[]nats.Option) {
  *opts = append(*opts, nats.ErrorHandler(func(nc *nats.Conn, sub *nats.Subscription, err error) {
    if err == nats.ErrSlowConsumer {
      pendingMsgs, pendingBytes, pErr := sub.Pending()
      dropped, dErr := sub.Dropped()
      if pErr == nil {
        if dErr == nil {
          slog.Warn("bus: slow consumer", "subject", sub.Subject, "pending_msgs", pendingMsgs, "pending_bytes", pendingBytes, "dropped_msgs", dropped)
        } else {
          slog.Warn("bus: slow consumer", "subject", sub.Subject, "pending_msgs", pendingMsgs, "pending_bytes", pendingBytes)
        }
      } else {
        slog.Warn("bus: slow consumer", "subject", sub.Subject, "pending_err", pErr)
      }
      return
    }
    slog.Warn("bus: async nats error", "err", err)
  }))
}

func setCorePendingLimits(sub *nats.Subscription) {
  // Example starting point, tune per workload.
  const msgLimit = 20000
  const bytesLimit = 64 * 1024 * 1024
  if err := sub.SetPendingLimits(msgLimit, bytesLimit); err != nil {
    slog.Warn("bus: set pending limits failed", "subject", sub.Subject, "err", err)
  }
}

Rollout and validation

Do not merge this blind. Reproduce bursts in staging and compare pending and dropped counters before and after.

Validation runbook

bash

# 1) Force burst traffic above normal steady-state rate
# 2) Compare before/after dropped and pending signals

kubectl -n cordum logs deploy/cordum-scheduler | rg "slow consumer|pending_msgs|pending_bytes|async nats error"

# Optional: inspect server side count
curl -s http://nats:8222/varz | jq '.slow_consumers'

# 3) Tune per subject class
# - low-latency control subjects: tighter pending limits
# - bursty telemetry subjects: looser byte budget with queue scale-out

If dropped counts rise after tightening limits, either scale queue workers or split subject namespaces before relaxing limits.

Limitations and tradeoffs

Approach	Upside	Downside
Keep default core pending limits	No behavior change and no tuning work.	Risk of delayed detection and surprise drops under bursts.
Add explicit limits + async error callback	Predictable queue budgets and faster operator signal.	Requires per-subject tuning to avoid over-aggressive drops.
Set unlimited pending queues	Fewer immediate drop events in short spikes.	Memory growth risk and worse failure modes under sustained mismatch.

Next step

Patch one service first, run burst tests, and publish a small internal default matrix by subject type before rolling the change across all control-plane binaries.

AI Agent NATS Slow Consumer Guardrails

The production problem

What top results cover and miss

Cordum runtime behavior

Practical patch pattern

Rollout and validation

Limitations and tradeoffs

Next step

Related Articles

AI Agent NATS Drain vs Close: Prevent Shutdown Message Loss in Control Planes (2026)

AI Agent NATS Reconnect Observability: Turn Callback Logs into SLO Signals (2026)

AI Agent MaxAckPending Tuning: Prevent JetStream Consumer Starvation (2026)

Need production-safe agent governance?