Skip to content
Deep Dive

AI Agent NATS Cold-Start Reconnect

Infinite reconnect settings do not guarantee successful first boot.

Deep Dive10 min readApr 2026
TL;DR
  • -`MaxReconnects(-1)` and `ReconnectWait(2s)` only help after an established connection drops.
  • -If first dial fails, `nats.Connect(...)` returns an error unless `RetryOnFailedConnect(true)` is set.
  • -Cordum services currently exit startup on NATS connect error, so broker downtime at boot becomes restart-loop behavior.
  • -`RetryOnFailedConnect(true)` without a startup deadline can keep the process alive but disconnected for too long.
  • -Pick one strategy on purpose: fail-fast with supervisor restarts, or in-process retry during initial connect.
Cold-start gap

Reconnect settings and startup behavior are different paths. Many teams merge them and get surprised.

Current Cordum path

NATS connection failure during process boot returns an error and startup exits.

Deterministic fix

You can keep fail-fast semantics or switch to retry-on-failed-connect. Add startup budget and probe rules.

Scope

This guide covers startup and reconnect behavior for Cordum services using the Go NATS client. It does not cover broker-side clustering or account design.

The production problem

Team runs a maintenance restart. NATS comes up slower than scheduler and gateway. Services crash on boot, then restart in a loop.

Operators are confused because they configured infinite reconnect. They expected the process to wait and recover by itself.

The catch is simple. Reconnect settings help after a connection existed. First-boot dial failure is a different code path.

What top results miss

SourceStrong coverageMissing piece
NATS docs: Set the Number of Reconnect AttemptsPer-server reconnect attempt semantics and server-pool removal behavior.Does not define what process should do when first dial fails during boot.
nats.go package docs (Advanced Usage)`RetryOnFailedConnect(true)` keeps connection in reconnecting state instead of failing immediately.No startup SLO guidance for when to keep waiting versus exit with error.
NATS tutorial: Advanced Connect and Custom Dialer in GoContext-aware connection loop with cancellation and dial deadlines.No operational contract for Kubernetes probes and restart budget during cold-start outage.

Cordum runtime behavior

BoundaryObserved behaviorOperational impact
Client options in bus initCordum sets `nats.MaxReconnects(-1)` and `nats.ReconnectWait(2 * time.Second)`.Established connections recover indefinitely with a fixed wait cadence.
Initial dial path`NewNatsBus` calls `nats.Connect(url, opts...)` and returns error on failure.If NATS is down at boot, caller gets immediate startup error.
RetryOnFailedConnect semanticsWhen enabled, `Connect` can return a reconnecting connection instead of immediate dial error.Process can stay alive while disconnected, so readiness gating becomes mandatory.
Scheduler startup`cmd/cordum-scheduler/main.go` exits with `os.Exit(1)` when NATS connect fails.Kubernetes or systemd restarts process until NATS is reachable.
Gateway and workflow-engine startupBoth return startup error when `NewNatsBus` fails.System availability follows dependency ordering and restart policy quality.
RetryOnFailedConnectNot currently set in Cordum bus options.Initial failure handling stays fail-fast rather than in-process retry.

Startup budget and probes

The missing piece in most reconnect guides is startup policy math. If you choose in-process retry, define a deadline up front and align probes to that number.

startup-budget.go
go
const (
  connectTimeout = 10 * time.Second
  servers        = 3
  passes         = 3
  margin         = 15 * time.Second
)

// Example: 10s * 3 servers * 3 passes + 15s = 105s
startupBudget := time.Duration(servers*passes)*connectTimeout + margin
deadline := time.Now().Add(startupBudget)

for !nc.IsConnected() {
  if time.Now().After(deadline) {
    return fmt.Errorf("nats not connected within startup budget (%s)", startupBudget)
  }
  time.Sleep(250 * time.Millisecond)
}
StrategyStartup probe behaviorReadiness behaviorLiveness behavior
Fail-fast on first dial errorKeep startupProbe strict. Process exits quickly if NATS is unavailable.Can remain simple, because dependency failure is expressed by process exit.Standard liveness check is fine; restarts come from startup failure.
RetryOnFailedConnect with deadlineSet startupProbe window to your explicit connect budget.Return not-ready until `nc.IsConnected()` is true.Do not fail liveness only because NATS is down, or you will restart-loop again.
Retry forever with no deadlinePods can sit in startup or ready=false for a long time.Prevents traffic if implemented correctly, but hides outage age.Risk of zombie pods that never become useful.

Code-level mechanics

1) Bus options include reconnect settings

core/infra/bus/nats.go
go
opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Info("bus: disconnected from nats", "err", err)
  }),
  nats.ReconnectHandler(func(nc *nats.Conn) {
    slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
  }),
  nats.ClosedHandler(func(nc *nats.Conn) {
    slog.Info("bus: connection closed")
  }),
}

nc, err := nats.Connect(url, opts...)
if err != nil {
  return nil, fmt.Errorf("connect nats %s: %w", url, err)
}

2) Startup exits if initial connect fails

cmd/cordum-scheduler/main.go
go
natsBus, err := bus.NewNatsBus(cfg.NatsURL)
if err != nil {
  slog.Error("failed to connect to NATS", "error", err)
  os.Exit(1)
}

3) Optional patch: retry on failed initial connect

core/infra/bus/nats.go (example)
go
opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.Timeout(10 * time.Second),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.RetryOnFailedConnect(true),
  nats.ReconnectJitter(250*time.Millisecond, 1*time.Second),
}

nc, err := nats.Connect(url, opts...)
if err != nil {
  return nil, fmt.Errorf("connect nats %s: %w", url, err)
}

This patch changes failure mode, not correctness by itself. If you adopt it, you also need clear liveness and readiness semantics so traffic does not hit a service that is still reconnecting.

4) Gate readiness on connection state

readiness.go
go
func readinessHandler(nc *nats.Conn) http.HandlerFunc {
  return func(w http.ResponseWriter, _ *http.Request) {
    if nc == nil || !nc.IsConnected() {
      http.Error(w, "nats disconnected", http.StatusServiceUnavailable)
      return
    }
    w.WriteHeader(http.StatusOK)
    _, _ = w.Write([]byte("ok"))
  }
}

Pair this with the startup budget above. Startup controls how long you wait before failing. Readiness controls whether traffic is allowed while reconnect is still in progress.

Operator runbook

Validate behavior in staging before changing production defaults. Measure restart count and time-to-recover under controlled NATS outages.

staging-runbook.sh
bash
# 1) Confirm current behavior in staging
kubectl -n cordum scale statefulset nats --replicas=0
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum get pods -w | rg cordum-scheduler

# Expected today: CrashLoopBackOff until NATS returns.

# 2) Restore NATS and verify recovery
kubectl -n cordum scale statefulset nats --replicas=3
kubectl -n cordum rollout status deploy/cordum-scheduler

# 3) If you switch to RetryOnFailedConnect(true), enforce a startup budget first
# Example budget: 105s (10s timeout * 3 servers * 3 passes + 15s margin)

# 4) Verify process behavior and probes
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected|reconnected|connect nats"
kubectl -n cordum describe pod -l app=cordum-scheduler | rg "Startup|Readiness|Liveness"

# 5) Track restart and recovery SLO
kubectl -n cordum get pods -o json   | jq '.items[] | {name: .metadata.name, restarts: .status.containerStatuses[0].restartCount}'

Limitations and tradeoffs

ApproachBenefitsCosts
Fail-fast + restart policySimple mental model. Fast crash signal when dependency is unavailable.Noisy restart loops during dependency outages. Can hide real regressions in alert noise.
RetryOnFailedConnect + startup deadlineFewer restarts while preserving deterministic failure after budget expiry.Needs explicit timeout math and probe tuning per service.
RetryOnFailedConnect with no deadlineAvoids crash loops during long broker maintenance windows.Can hide outages behind long-running but disconnected pods.
Init wait for NATS endpointKeeps app code unchanged. Clear startup dependency ordering.Adds startup latency and extra lifecycle wiring. Still needs outage handling after boot.

Next step

Decide your startup policy explicitly this week, write it into your runbook, and test it by forcing a NATS outage during a controlled rollout.

Related Reads