The production problem
Team runs a maintenance restart. NATS comes up slower than scheduler and gateway. Services crash on boot, then restart in a loop.
Operators are confused because they configured infinite reconnect. They expected the process to wait and recover by itself.
The catch is simple. Reconnect settings help after a connection existed. First-boot dial failure is a different code path.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Pausing Between Reconnect Attempts | Reconnect interval tuning and thundering herd control. | Does not map reconnect knobs to application startup behavior when first connect fails. |
| NATS docs: Buffering During Reconnect Attempts | Client-side reconnect buffer semantics and delivery caveats. | Focuses on reconnecting clients, not services that never reached connected state. |
| nats.go package docs | Options such as `MaxReconnects`, `ReconnectWait`, `ReconnectBufSize`, and `RetryOnFailedConnect`. | No control-plane guidance on choosing fail-fast startup versus initial retry in multi-service deployments. |
Cordum runtime behavior
| Boundary | Observed behavior | Operational impact |
|---|---|---|
| Client options in bus init | Cordum sets `nats.MaxReconnects(-1)` and `nats.ReconnectWait(2 * time.Second)`. | Established connections recover indefinitely with a fixed wait cadence. |
| Initial dial path | `NewNatsBus` calls `nats.Connect(url, opts...)` and returns error on failure. | If NATS is down at boot, caller gets immediate startup error. |
| Scheduler startup | `cmd/cordum-scheduler/main.go` exits with `os.Exit(1)` when NATS connect fails. | Kubernetes or systemd restarts process until NATS is reachable. |
| Gateway and workflow-engine startup | Both return startup error when `NewNatsBus` fails. | System availability follows dependency ordering and restart policy quality. |
| RetryOnFailedConnect | Not currently set in Cordum bus options. | Initial failure handling stays fail-fast rather than in-process retry. |
Code-level mechanics
1) Bus options include reconnect settings
opts := []nats.Option{
nats.Name("cordum-bus"),
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
slog.Info("bus: disconnected from nats", "err", err)
}),
nats.ReconnectHandler(func(nc *nats.Conn) {
slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
}),
nats.ClosedHandler(func(nc *nats.Conn) {
slog.Info("bus: connection closed")
}),
}
nc, err := nats.Connect(url, opts...)
if err != nil {
return nil, fmt.Errorf("connect nats %s: %w", url, err)
}2) Startup exits if initial connect fails
natsBus, err := bus.NewNatsBus(cfg.NatsURL)
if err != nil {
slog.Error("failed to connect to NATS", "error", err)
os.Exit(1)
}
3) Optional patch: retry on failed initial connect
opts := []nats.Option{
nats.Name("cordum-bus"),
nats.MaxReconnects(-1),
nats.ReconnectWait(2 * time.Second),
nats.RetryOnFailedConnect(true),
nats.ReconnectJitter(250*time.Millisecond, 1*time.Second),
}
nc, err := nats.Connect(url, opts...)
if err != nil {
return nil, fmt.Errorf("connect nats %s: %w", url, err)
}This patch changes failure mode, not correctness by itself. If you adopt it, you also need clear liveness and readiness semantics so traffic does not hit a service that is still reconnecting.
Operator runbook
Validate behavior in staging before changing production defaults. Measure restart count and time-to-recover under controlled NATS outages.
# 1) Confirm current behavior in staging
kubectl -n cordum scale statefulset nats --replicas=0
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum get pods -w | rg cordum-scheduler
# Expected today: CrashLoopBackOff until NATS returns.
# 2) Restore NATS and verify recovery
kubectl -n cordum scale statefulset nats --replicas=3
kubectl -n cordum rollout status deploy/cordum-scheduler
# 3) If you switch to RetryOnFailedConnect(true), verify process no longer restarts
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected|reconnected|connect nats"
# 4) Track restart SLO
kubectl -n cordum get pods -o json | jq '.items[] | {name: .metadata.name, restarts: .status.containerStatuses[0].restartCount}'Limitations and tradeoffs
| Approach | Benefits | Costs |
|---|---|---|
| Fail-fast + restart policy | Simple mental model. Fast crash signal when dependency is unavailable. | Noisy restart loops during dependency outages. Can hide real regressions in alert noise. |
| RetryOnFailedConnect at startup | Fewer process restarts. Better continuity during short broker outages. | Service may stay up but not ready. Requires strong readiness gating and timeout policy. |
| Init wait for NATS endpoint | Keeps app code unchanged. Clear startup dependency ordering. | Adds startup latency and extra lifecycle wiring. Still needs outage handling after boot. |
Next step
Decide your startup policy explicitly this week, write it into your runbook, and test it by forcing a NATS outage during a controlled rollout.