AI Agent NATS Reconnect Buffer Sizing: Avoid Silent Drops During Broker Outages (2026)

The production problem

A broker outage starts. Control-plane replicas keep trying to publish operational events.

Reconnect buffer fills. Publish calls begin returning errors. If those errors are ignored upstream, state signals vanish and operators only see symptoms later.

This is not rare edge behavior. It is default reconnect physics under sustained disconnect.

What top results miss

Source	Strong coverage	Missing piece
NATS docs: Buffering Messages During Reconnect Attempts	Reconnect buffer behavior and the risk that app-level send appears successful while delivery fails.	No concrete sizing process for multi-component control planes with different message criticality classes.
NATS docs: Automatic Reconnections	Reconnect lifecycle, callback hooks, and high-level reliability caveats.	Does not map buffer boundaries to mixed Core NATS vs JetStream publish paths.
nats.go package docs	`ReconnectBufSize` semantics, default 8MB, and overflow behavior via publish errors.	No workload-specific formula for choosing per-replica buffer size versus memory headroom.

Cordum runtime behavior

Boundary	Observed behavior	Operational impact
Reconnect buffer default	Cordum bus options do not set `ReconnectBufSize`; client default applies.	Queueing limit during disconnect is implicit unless operator changes code.
Overflow behavior	When reconnect buffer fills, publish returns an error.	Callers must handle publish failures or lose control signals under outage stress.
Durable subjects	With JetStream enabled, durable subjects publish through `b.js.Publish(...)`.	Durable path behavior differs from core publish buffering and should be monitored separately.
Core broadcast subjects	Heartbeat/handshake/config-change style subjects intentionally stay on Core NATS.	These paths are exposed to reconnect buffer limits and at-most-once semantics.
Replica multiplier	Reconnect buffer is per process, not cluster-shared.	A 32MB setting across 20 replicas implies up to 640MB buffered exposure under outage pressure.

Code-level mechanics

1) Core vs durable publish path

core/infra/bus/nats.go

func (b *NatsBus) Publish(subject string, packet *pb.BusPacket) error {
  ...
  if b != nil && b.jsEnabled && isDurableSubject(subject) {
    _, err = b.js.Publish(subject, data)
    if err != nil {
      return fmt.Errorf("publish %s: %w", subject, err)
    }
    return nil
  }
  if err := b.nc.Publish(subject, data); err != nil {
    return fmt.Errorf("publish %s: %w", subject, err)
  }
  return nil
}

2) Effective default reconnect buffer

nats options

// nats.go option docs
// ReconnectBufSize sets the buffer size of messages kept while busy reconnecting.
// Defaults to 8388608 bytes (8MB).
// Once exhausted, publish operations return an error.

// Cordum currently sets:
opts := []nats.Option{
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  // no ReconnectBufSize override yet
}

3) Explicit buffer sizing example

core/infra/bus/nats.go (example)

opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.ReconnectBufSize(32 * 1024 * 1024), // 32MB example
  nats.ReconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Warn("bus: reconnect attempt failed", "err", err)
  }),
}

Size this against memory budgets and expected publish burst during outage windows. Bigger is not always safer if it hides persistent disconnects.

4) Sizing math before tuning

buffer-sizing.sh

bash

# Example sizing worksheet
# peak_events_per_sec = 1800
# avg_payload_bytes   = 700
# outage_budget_sec   = 20
# safety_factor       = 1.5

peak_bytes_per_sec = 1800 * 700      # 1,260,000 B/s
required_bytes = peak_bytes_per_sec * 20 * 1.5
# required_bytes ~= 37,800,000 (about 36 MB)

# If default is 8 MB, buffer exhausts in about:
exhaust_seconds = 8_388_608 / 1_260_000
# exhaust_seconds ~= 6.7s

Operator runbook

Test reconnect buffer behavior under realistic outage conditions before tuning in production.

staging-runbook.sh

bash

# 1) Measure publish error baseline
kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:"

# 2) Simulate short broker outage
kubectl -n cordum rollout restart statefulset/nats

# 3) Watch for publish errors while disconnected
kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:"
kubectl -n cordum logs deploy/cordum-gateway | rg "publish .*:"

# 4) If errors spike, trial explicit ReconnectBufSize in staging
# 5) Compute required buffer from real peak publish rate and outage budget
# 6) Re-run outage drill and compare:
#    - publish error count
#    - process RSS growth
#    - reconnect recovery time

Limitations and tradeoffs

Option	Benefit	Cost
Keep default 8MB	No extra memory planning required.	May hit publish-error boundary sooner during long disconnect windows.
Increase reconnect buffer	Absorbs larger outage bursts before publish failures surface.	Higher process memory footprint and delayed visibility of prolonged outages.
Disable reconnect buffering (-1)	Immediate publish failure signal during disconnect.	Higher error volume; requires strict retry policy in calling code.

Next step

Run one outage drill this week, record publish error counts and memory use, then pick an explicit `ReconnectBufSize` policy for each control-plane component.

Open operations docs Review workers docs

AI Agent NATS Reconnect Buffer Sizing