The production problem
A broker node flaps. Every control-plane replica drops and starts reconnecting at the same cadence.
The app survives. The broker recovery window gets hammered by synchronized retries and reconnect buffers.
This is not a logic bug. It is a retry-shape bug.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Avoiding the Thundering Herd | Why reconnect storms happen and why randomization matters in clusters. | Does not quantify operational burst behavior for application replicas with identical reconnect intervals. |
| NATS docs: Pausing Between Reconnect Attempts | Reconnect wait configuration and multi-language examples. | No control-plane tuning guidance for balancing reconnect speed against outage blast radius. |
| nats.go package docs | `ReconnectJitter`, `ReconnectWait`, `ReconnectBufSize`, and related reconnect options. | No end-to-end rollout playbook for scheduler, gateway, and workflow replicas sharing one broker tier. |
Cordum runtime behavior
| Boundary | Observed behavior | Operational impact |
|---|---|---|
| Reconnect attempt count | Cordum sets `nats.MaxReconnects(-1)`. | Clients keep retrying indefinitely after disconnection. |
| Reconnect interval | Cordum sets `nats.ReconnectWait(2 * time.Second)`. | All replicas retry on a tight fixed cadence unless jitter is also configured. |
| Jitter option | `nats.ReconnectJitter(...)` is not currently set in bus options. | Retry attempts can align in waves under outage conditions. |
| Buffer behavior | nats.go reconnect buffer defaults to 8MB if not overridden. | During disconnect, publish pressure can accumulate and later flush in bursts. |
Code-level mechanics
1) Current reconnect shape in Cordum
opts := []nats.Option{
nats.Name("cordum-bus"),
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
slog.Info("bus: disconnected from nats", "err", err)
}),
nats.ReconnectHandler(func(nc *nats.Conn) {
slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
}),
}2) Jittered reconnect option
opts := []nats.Option{
nats.Name("cordum-bus"),
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
nats.ReconnectJitter(500*time.Millisecond, 2*time.Second),
nats.ReconnectErrHandler(func(nc *nats.Conn, err error) {
slog.Warn("bus: reconnect attempt failed", "err", err)
}),
}Keep the wait short enough for recovery objectives, then add jitter to break phase alignment across replicas.
3) Back-of-envelope outage math
// Simple burst estimate // replicas = 150 // wait = 2s // no jitter -> 150 attempts can align at once every 2s // average rate = 75 attempts/s, but peak burst is the real problem // with jitter window ~500ms, attempts spread across that window // peak drops because retries are no longer phase-aligned
Operator runbook
Roll this in stages. Change one component first, compare reconnect noise, then extend to gateway and workflow-engine.
# 1) Baseline reconnect storm signals kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats" # 2) Simulate broker turbulence in staging kubectl -n cordum rollout restart statefulset/nats kubectl -n cordum get pods -w | rg "cordum-scheduler|cordum-gateway|workflow" # 3) Roll jitter change to one deployment first kubectl -n cordum rollout restart deploy/cordum-scheduler # 4) Compare metrics before/after # - reconnect errors per minute # - broker auth/connect CPU during recovery window # - time to stable connected state
Limitations and tradeoffs
| Choice | Benefit | Cost |
|---|---|---|
| No jitter, fixed wait | Predictable schedule and simpler reasoning in single-replica setups. | Higher retry phase alignment and broker pressure during outages. |
| Moderate jitter | Lower peak retry bursts and smoother broker recovery curve. | Slightly less deterministic reconnect timing per replica. |
| Large jitter window | Strong burst smoothing under massive replica count. | Longer tail to full reconnection across all replicas. |
Next step
Add reconnect jitter in staging this sprint, run one broker flap drill, and decide a cluster-wide default based on measured reconnect burst and recovery time.