Skip to content
Deep Dive

AI Agent NATS Reconnect Buffer Sizing

Outages are inevitable. Buffer exhaustion without error handling is optional.

Deep Dive10 min readMar 2026
TL;DR
  • -nats.go defaults reconnect buffer to 8MB unless `ReconnectBufSize(...)` is explicitly set.
  • -When reconnect buffer is exhausted, publish operations return errors instead of queueing indefinitely.
  • -Cordum currently does not set `ReconnectBufSize`; core subjects depend on this default boundary.
  • -Durable subjects use JetStream publish path, but core broadcast/control subjects still need explicit outage handling.
Hidden boundary

Reconnect buffer is a finite queue. Treat it as a hard limit, not a safety blanket.

Current default

Without an explicit option, nats.go uses an 8MB reconnect buffer for queued publishes.

Mixed paths

Cordum routes durable subjects to JetStream and many broadcast subjects to Core NATS.

Scope

This guide covers reconnect buffer behavior for Core NATS publish paths in Cordum. JetStream stream retention and consumer redelivery are separate concerns.

The production problem

A broker outage starts. Control-plane replicas keep trying to publish operational events.

Reconnect buffer fills. Publish calls begin returning errors. If those errors are ignored upstream, state signals vanish and operators only see symptoms later.

This is not rare edge behavior. It is default reconnect physics under sustained disconnect.

What top results miss

SourceStrong coverageMissing piece
NATS docs: Buffering Messages During Reconnect AttemptsReconnect buffer behavior and the risk that app-level send appears successful while delivery fails.No concrete sizing process for multi-component control planes with different message criticality classes.
NATS docs: Automatic ReconnectionsReconnect lifecycle, callback hooks, and high-level reliability caveats.Does not map buffer boundaries to mixed Core NATS vs JetStream publish paths.
nats.go package docs`ReconnectBufSize` semantics, default 8MB, and overflow behavior via publish errors.No workload-specific guidance for choosing buffer size versus memory headroom in orchestrated replicas.

Cordum runtime behavior

BoundaryObserved behaviorOperational impact
Reconnect buffer defaultCordum bus options do not set `ReconnectBufSize`; client default applies.Queueing limit during disconnect is implicit unless operator changes code.
Overflow behaviorWhen reconnect buffer fills, publish returns an error.Callers must handle publish failures or lose control signals under outage stress.
Durable subjectsWith JetStream enabled, durable subjects publish through `b.js.Publish(...)`.Durable path behavior differs from core publish buffering and should be monitored separately.
Core broadcast subjectsHeartbeat/handshake/config-change style subjects intentionally stay on Core NATS.These paths are exposed to reconnect buffer limits and at-most-once semantics.

Code-level mechanics

1) Core vs durable publish path

core/infra/bus/nats.go
go
func (b *NatsBus) Publish(subject string, packet *pb.BusPacket) error {
  ...
  if b != nil && b.jsEnabled && isDurableSubject(subject) {
    _, err = b.js.Publish(subject, data)
    if err != nil {
      return fmt.Errorf("publish %s: %w", subject, err)
    }
    return nil
  }
  if err := b.nc.Publish(subject, data); err != nil {
    return fmt.Errorf("publish %s: %w", subject, err)
  }
  return nil
}

2) Effective default reconnect buffer

nats options
go
// nats.go option docs
// ReconnectBufSize sets the buffer size of messages kept while busy reconnecting.
// Defaults to 8388608 bytes (8MB).
// Once exhausted, publish operations return an error.

// Cordum currently sets:
opts := []nats.Option{
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  // no ReconnectBufSize override yet
}

3) Explicit buffer sizing example

core/infra/bus/nats.go (example)
go
opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.ReconnectBufSize(32 * 1024 * 1024), // 32MB example
  nats.ReconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Warn("bus: reconnect attempt failed", "err", err)
  }),
}

Size this against memory budgets and expected publish burst during outage windows. Bigger is not always safer if it hides persistent disconnects.

Operator runbook

Test reconnect buffer behavior under realistic outage conditions before tuning in production.

staging-runbook.sh
bash
# 1) Measure publish error baseline
kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:"

# 2) Simulate short broker outage
kubectl -n cordum rollout restart statefulset/nats

# 3) Watch for publish errors while disconnected
kubectl -n cordum logs deploy/cordum-scheduler | rg "publish .*:"
kubectl -n cordum logs deploy/cordum-gateway | rg "publish .*:"

# 4) If errors spike, trial explicit ReconnectBufSize in staging
# 5) Re-run outage drill and compare:
#    - publish error count
#    - process RSS growth
#    - reconnect recovery time

Limitations and tradeoffs

OptionBenefitCost
Keep default 8MBNo extra memory planning required.May hit publish-error boundary sooner during long disconnect windows.
Increase reconnect bufferAbsorbs larger outage bursts before publish failures surface.Higher process memory footprint and delayed visibility of prolonged outages.
Disable reconnect buffering (-1)Immediate publish failure signal during disconnect.Higher error volume; requires strict retry policy in calling code.

Next step

Run one outage drill this week, record publish error counts and memory use, then pick an explicit `ReconnectBufSize` policy for each control-plane component.

Related Reads