The production problem
Team runs a maintenance restart. NATS comes up slower than scheduler and gateway. Services crash on boot, then restart in a loop.
Operators are confused because they configured infinite reconnect. They expected the process to wait and recover by itself.
The catch is simple. Reconnect settings help after a connection existed. First-boot dial failure is a different code path.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Set the Number of Reconnect Attempts | Per-server reconnect attempt semantics and server-pool removal behavior. | Does not define what process should do when first dial fails during boot. |
| nats.go package docs (Advanced Usage) | `RetryOnFailedConnect(true)` keeps connection in reconnecting state instead of failing immediately. | No startup SLO guidance for when to keep waiting versus exit with error. |
| NATS tutorial: Advanced Connect and Custom Dialer in Go | Context-aware connection loop with cancellation and dial deadlines. | No operational contract for Kubernetes probes and restart budget during cold-start outage. |
Cordum runtime behavior
| Boundary | Observed behavior | Operational impact |
|---|---|---|
| Client options in bus init | Cordum sets `nats.MaxReconnects(-1)` and `nats.ReconnectWait(2 * time.Second)`. | Established connections recover indefinitely with a fixed wait cadence. |
| Initial dial path | `NewNatsBus` calls `nats.Connect(url, opts...)` and returns error on failure. | If NATS is down at boot, caller gets immediate startup error. |
| RetryOnFailedConnect semantics | When enabled, `Connect` can return a reconnecting connection instead of immediate dial error. | Process can stay alive while disconnected, so readiness gating becomes mandatory. |
| Scheduler startup | `cmd/cordum-scheduler/main.go` exits with `os.Exit(1)` when NATS connect fails. | Kubernetes or systemd restarts process until NATS is reachable. |
| Gateway and workflow-engine startup | Both return startup error when `NewNatsBus` fails. | System availability follows dependency ordering and restart policy quality. |
| RetryOnFailedConnect | Not currently set in Cordum bus options. | Initial failure handling stays fail-fast rather than in-process retry. |
Startup budget and probes
The missing piece in most reconnect guides is startup policy math. If you choose in-process retry, define a deadline up front and align probes to that number.
const (
connectTimeout = 10 * time.Second
servers = 3
passes = 3
margin = 15 * time.Second
)
// Example: 10s * 3 servers * 3 passes + 15s = 105s
startupBudget := time.Duration(servers*passes)*connectTimeout + margin
deadline := time.Now().Add(startupBudget)
for !nc.IsConnected() {
if time.Now().After(deadline) {
return fmt.Errorf("nats not connected within startup budget (%s)", startupBudget)
}
time.Sleep(250 * time.Millisecond)
}| Strategy | Startup probe behavior | Readiness behavior | Liveness behavior |
|---|---|---|---|
| Fail-fast on first dial error | Keep startupProbe strict. Process exits quickly if NATS is unavailable. | Can remain simple, because dependency failure is expressed by process exit. | Standard liveness check is fine; restarts come from startup failure. |
| RetryOnFailedConnect with deadline | Set startupProbe window to your explicit connect budget. | Return not-ready until `nc.IsConnected()` is true. | Do not fail liveness only because NATS is down, or you will restart-loop again. |
| Retry forever with no deadline | Pods can sit in startup or ready=false for a long time. | Prevents traffic if implemented correctly, but hides outage age. | Risk of zombie pods that never become useful. |
Code-level mechanics
1) Bus options include reconnect settings
opts := []nats.Option{
nats.Name("cordum-bus"),
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
slog.Info("bus: disconnected from nats", "err", err)
}),
nats.ReconnectHandler(func(nc *nats.Conn) {
slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
}),
nats.ClosedHandler(func(nc *nats.Conn) {
slog.Info("bus: connection closed")
}),
}
nc, err := nats.Connect(url, opts...)
if err != nil {
return nil, fmt.Errorf("connect nats %s: %w", url, err)
}2) Startup exits if initial connect fails
natsBus, err := bus.NewNatsBus(cfg.NatsURL)
if err != nil {
slog.Error("failed to connect to NATS", "error", err)
os.Exit(1)
}
3) Optional patch: retry on failed initial connect
opts := []nats.Option{
nats.Name("cordum-bus"),
nats.Timeout(10 * time.Second),
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
nats.RetryOnFailedConnect(true),
nats.ReconnectJitter(250*time.Millisecond, 1*time.Second),
}
nc, err := nats.Connect(url, opts...)
if err != nil {
return nil, fmt.Errorf("connect nats %s: %w", url, err)
}This patch changes failure mode, not correctness by itself. If you adopt it, you also need clear liveness and readiness semantics so traffic does not hit a service that is still reconnecting.
4) Gate readiness on connection state
func readinessHandler(nc *nats.Conn) http.HandlerFunc {
return func(w http.ResponseWriter, _ *http.Request) {
if nc == nil || !nc.IsConnected() {
http.Error(w, "nats disconnected", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("ok"))
}
}Pair this with the startup budget above. Startup controls how long you wait before failing. Readiness controls whether traffic is allowed while reconnect is still in progress.
Operator runbook
Validate behavior in staging before changing production defaults. Measure restart count and time-to-recover under controlled NATS outages.
# 1) Confirm current behavior in staging
kubectl -n cordum scale statefulset nats --replicas=0
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum get pods -w | rg cordum-scheduler
# Expected today: CrashLoopBackOff until NATS returns.
# 2) Restore NATS and verify recovery
kubectl -n cordum scale statefulset nats --replicas=3
kubectl -n cordum rollout status deploy/cordum-scheduler
# 3) If you switch to RetryOnFailedConnect(true), enforce a startup budget first
# Example budget: 105s (10s timeout * 3 servers * 3 passes + 15s margin)
# 4) Verify process behavior and probes
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected|reconnected|connect nats"
kubectl -n cordum describe pod -l app=cordum-scheduler | rg "Startup|Readiness|Liveness"
# 5) Track restart and recovery SLO
kubectl -n cordum get pods -o json | jq '.items[] | {name: .metadata.name, restarts: .status.containerStatuses[0].restartCount}'Limitations and tradeoffs
| Approach | Benefits | Costs |
|---|---|---|
| Fail-fast + restart policy | Simple mental model. Fast crash signal when dependency is unavailable. | Noisy restart loops during dependency outages. Can hide real regressions in alert noise. |
| RetryOnFailedConnect + startup deadline | Fewer restarts while preserving deterministic failure after budget expiry. | Needs explicit timeout math and probe tuning per service. |
| RetryOnFailedConnect with no deadline | Avoids crash loops during long broker maintenance windows. | Can hide outages behind long-running but disconnected pods. |
| Init wait for NATS endpoint | Keeps app code unchanged. Clear startup dependency ordering. | Adds startup latency and extra lifecycle wiring. Still needs outage handling after boot. |
Next step
Decide your startup policy explicitly this week, write it into your runbook, and test it by forcing a NATS outage during a controlled rollout.