Skip to content
Deep Dive

AI Agent NATS Reconnect Jitter

Fixed reconnect cadences are clean in code and noisy in outages.

Deep Dive10 min readMar 2026
TL;DR
  • -Cordum currently uses fixed reconnect cadence: `ReconnectWait(2s)` with `MaxReconnects(-1)` and no jitter option.
  • -Fixed cadence across many replicas can create synchronized reconnect bursts during NATS outages.
  • -nats.go supports `ReconnectJitter(jitter, jitterForTLS)` and defaults are small (100ms / 1s).
  • -Add explicit jitter and monitor reconnect error rate before and after rollout.
Burst risk

If 150 replicas reconnect every 2 seconds, retry pressure arrives in waves, not a smooth flow.

Current setting

Cordum sets fixed reconnect wait and infinite reconnect attempts in bus initialization.

Storm control

Jitter spreads attempts over time so broker recovery has a wider breathing window.

Scope

This guide focuses on reconnect storm control in Core NATS client behavior for Cordum components, not JetStream consumer redelivery policy.

The production problem

A broker node flaps. Every control-plane replica drops and starts reconnecting at the same cadence.

The app survives. The broker recovery window gets hammered by synchronized retries and reconnect buffers.

This is not a logic bug. It is a retry-shape bug.

What top results miss

SourceStrong coverageMissing piece
NATS docs: Avoiding the Thundering HerdWhy reconnect storms happen and why randomization matters in clusters.Does not quantify operational burst behavior for application replicas with identical reconnect intervals.
NATS docs: Pausing Between Reconnect AttemptsReconnect wait configuration and multi-language examples.No control-plane tuning guidance for balancing reconnect speed against outage blast radius.
nats.go package docs`ReconnectJitter`, `ReconnectWait`, `ReconnectBufSize`, and related reconnect options.No end-to-end rollout playbook for scheduler, gateway, and workflow replicas sharing one broker tier.

Cordum runtime behavior

BoundaryObserved behaviorOperational impact
Reconnect attempt countCordum sets `nats.MaxReconnects(-1)`.Clients keep retrying indefinitely after disconnection.
Reconnect intervalCordum sets `nats.ReconnectWait(2 * time.Second)`.All replicas retry on a tight fixed cadence unless jitter is also configured.
Jitter option`nats.ReconnectJitter(...)` is not currently set in bus options.Retry attempts can align in waves under outage conditions.
Buffer behaviornats.go reconnect buffer defaults to 8MB if not overridden.During disconnect, publish pressure can accumulate and later flush in bursts.

Code-level mechanics

1) Current reconnect shape in Cordum

core/infra/bus/nats.go
go
opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Info("bus: disconnected from nats", "err", err)
  }),
  nats.ReconnectHandler(func(nc *nats.Conn) {
    slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
  }),
}

2) Jittered reconnect option

core/infra/bus/nats.go (example)
go
opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.ReconnectJitter(500*time.Millisecond, 2*time.Second),
  nats.ReconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Warn("bus: reconnect attempt failed", "err", err)
  }),
}

Keep the wait short enough for recovery objectives, then add jitter to break phase alignment across replicas.

3) Back-of-envelope outage math

burst-estimate.go
go
// Simple burst estimate
// replicas = 150
// wait = 2s
// no jitter -> 150 attempts can align at once every 2s
// average rate = 75 attempts/s, but peak burst is the real problem

// with jitter window ~500ms, attempts spread across that window
// peak drops because retries are no longer phase-aligned

Operator runbook

Roll this in stages. Change one component first, compare reconnect noise, then extend to gateway and workflow-engine.

staging-runbook.sh
bash
# 1) Baseline reconnect storm signals
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats"

# 2) Simulate broker turbulence in staging
kubectl -n cordum rollout restart statefulset/nats
kubectl -n cordum get pods -w | rg "cordum-scheduler|cordum-gateway|workflow"

# 3) Roll jitter change to one deployment first
kubectl -n cordum rollout restart deploy/cordum-scheduler

# 4) Compare metrics before/after
# - reconnect errors per minute
# - broker auth/connect CPU during recovery window
# - time to stable connected state

Limitations and tradeoffs

ChoiceBenefitCost
No jitter, fixed waitPredictable schedule and simpler reasoning in single-replica setups.Higher retry phase alignment and broker pressure during outages.
Moderate jitterLower peak retry bursts and smoother broker recovery curve.Slightly less deterministic reconnect timing per replica.
Large jitter windowStrong burst smoothing under massive replica count.Longer tail to full reconnection across all replicas.

Next step

Add reconnect jitter in staging this sprint, run one broker flap drill, and decide a cluster-wide default based on measured reconnect burst and recovery time.

Related Reads