AI Agent NATS Reconnect Jitter: Stop Thundering Herd Storms in Control Planes (2026)

The production problem

A broker node flaps. Every control-plane replica drops and starts reconnecting at the same cadence.

The app survives. The broker recovery window gets hammered by synchronized retries and reconnect buffers.

This is not a logic bug. It is a retry-shape bug.

What top results miss

Source	Strong coverage	Missing piece
NATS docs: Avoiding the Thundering Herd	Why reconnect storms happen and why randomization matters in clusters.	Does not quantify operational burst behavior for application replicas with identical reconnect intervals.
NATS docs: Pausing Between Reconnect Attempts	Reconnect wait configuration and multi-language examples.	No attempt-aware backoff guidance for long outages with infinite reconnect settings.
nats.go package docs	`ReconnectJitter`, `ReconnectWait`, `ReconnectBufSize`, and related reconnect options.	No end-to-end rollout playbook for scheduler, gateway, and workflow replicas sharing one broker tier.

Cordum runtime behavior

Boundary	Observed behavior	Operational impact
Reconnect attempt count	Cordum sets `nats.MaxReconnects(-1)`.	Clients keep retrying indefinitely after disconnection.
Reconnect interval	Cordum sets `nats.ReconnectWait(2 * time.Second)`.	All replicas retry on a tight fixed cadence unless jitter is also configured.
Jitter option	`nats.ReconnectJitter(...)` is not currently set in bus options.	Retry attempts can align in waves under outage conditions.
Buffer behavior	nats.go reconnect buffer defaults to 8MB if not overridden.	During disconnect, publish pressure can accumulate and later flush in bursts.
Attempt-aware delay	nats.go offers `CustomReconnectDelayCB(attempts)` after a full server-list cycle fails.	You can widen delays gradually in long outages instead of hammering forever at near-fixed cadence.

Code-level mechanics

1) Current reconnect shape in Cordum

core/infra/bus/nats.go

opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Info("bus: disconnected from nats", "err", err)
  }),
  nats.ReconnectHandler(func(nc *nats.Conn) {
    slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
  }),
}

2) Jittered reconnect option

core/infra/bus/nats.go (example)

opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.ReconnectJitter(500*time.Millisecond, 2*time.Second),
  nats.ReconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Warn("bus: reconnect attempt failed", "err", err)
  }),
}

Keep the wait short enough for recovery objectives, then add jitter to break phase alignment across replicas.

3) Back-of-envelope outage math

burst-estimate.go

// Simple burst estimate
// replicas = 150
// wait = 2s
// no jitter -> 150 attempts can align at once every 2s
// average rate = 75 attempts/s, but peak burst is the real problem

// with jitter window ~500ms, attempts spread across that window
// peak drops because retries are no longer phase-aligned

// with attempt-aware delay, retry pressure also decreases over long outages

4) Attempt-aware delay for long outages

adaptive-reconnect.go

// Attempt-aware delay for prolonged outages
opts := []nats.Option{
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.ReconnectJitter(500*time.Millisecond, 2*time.Second),
  nats.CustomReconnectDelay(func(attempts int) time.Duration {
    // Exponential-ish growth with cap: 2s, 4s, 8s, 16s, ... max 20s
    backoffStep := attempts
    if backoffStep > 3 {
      backoffStep = 3
    }

    delay := 2 * time.Second * time.Duration(1<<backoffStep)
    if delay > 20*time.Second {
      delay = 20 * time.Second
    }
    return delay
  }),
}

Operator runbook

Roll this in stages. Change one component first, compare reconnect noise, then extend to gateway and workflow-engine.

staging-runbook.sh

bash

# 1) Baseline reconnect storm signals
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats"

# 2) Simulate broker turbulence in staging
kubectl -n cordum rollout restart statefulset/nats
kubectl -n cordum get pods -w | rg "cordum-scheduler|cordum-gateway|workflow"

# 3) Roll jitter change to one deployment first
kubectl -n cordum rollout restart deploy/cordum-scheduler

# 4) Compare metrics before/after
# - reconnect errors per minute
# - broker auth/connect CPU during recovery window
# - time to stable connected state
# - retry cadence after 60s outage mark

Limitations and tradeoffs

Choice	Benefit	Cost
No jitter, fixed wait	Predictable schedule and simpler reasoning in single-replica setups.	Higher retry phase alignment and broker pressure during outages.
Moderate jitter	Lower peak retry bursts and smoother broker recovery curve.	Slightly less deterministic reconnect timing per replica.
Large jitter window	Strong burst smoothing under massive replica count.	Longer tail to full reconnection across all replicas.

Next step

Add reconnect jitter in staging this sprint, run one broker flap drill, and decide a cluster-wide default based on measured reconnect burst and recovery time.

Open operations docs Review deployment docs

AI Agent NATS Reconnect Jitter