Skip to content
Deep Dive

AI Agent NATS Cold-Start Reconnect

Infinite reconnect settings do not guarantee successful first boot.

Deep Dive10 min readMar 2026
TL;DR
  • -`MaxReconnects(-1)` and `ReconnectWait(2s)` only help after an established connection drops.
  • -If first dial fails, `nats.Connect(...)` returns an error unless `RetryOnFailedConnect(true)` is set.
  • -Cordum services currently exit startup on NATS connect error, so broker downtime at boot becomes restart-loop behavior.
  • -Pick one strategy on purpose: fail-fast with supervisor restarts, or in-process retry during initial connect.
Cold-start gap

Reconnect settings and startup behavior are different paths. Many teams merge them and get surprised.

Current Cordum path

NATS connection failure during process boot returns an error and startup exits.

Deterministic fix

You can keep fail-fast semantics or switch to retry-on-failed-connect. Both are valid with clear guardrails.

Scope

This guide covers startup and reconnect behavior for Cordum services using the Go NATS client. It does not cover broker-side clustering or account design.

The production problem

Team runs a maintenance restart. NATS comes up slower than scheduler and gateway. Services crash on boot, then restart in a loop.

Operators are confused because they configured infinite reconnect. They expected the process to wait and recover by itself.

The catch is simple. Reconnect settings help after a connection existed. First-boot dial failure is a different code path.

What top results miss

SourceStrong coverageMissing piece
NATS docs: Pausing Between Reconnect AttemptsReconnect interval tuning and thundering herd control.Does not map reconnect knobs to application startup behavior when first connect fails.
NATS docs: Buffering During Reconnect AttemptsClient-side reconnect buffer semantics and delivery caveats.Focuses on reconnecting clients, not services that never reached connected state.
nats.go package docsOptions such as `MaxReconnects`, `ReconnectWait`, `ReconnectBufSize`, and `RetryOnFailedConnect`.No control-plane guidance on choosing fail-fast startup versus initial retry in multi-service deployments.

Cordum runtime behavior

BoundaryObserved behaviorOperational impact
Client options in bus initCordum sets `nats.MaxReconnects(-1)` and `nats.ReconnectWait(2 * time.Second)`.Established connections recover indefinitely with a fixed wait cadence.
Initial dial path`NewNatsBus` calls `nats.Connect(url, opts...)` and returns error on failure.If NATS is down at boot, caller gets immediate startup error.
Scheduler startup`cmd/cordum-scheduler/main.go` exits with `os.Exit(1)` when NATS connect fails.Kubernetes or systemd restarts process until NATS is reachable.
Gateway and workflow-engine startupBoth return startup error when `NewNatsBus` fails.System availability follows dependency ordering and restart policy quality.
RetryOnFailedConnectNot currently set in Cordum bus options.Initial failure handling stays fail-fast rather than in-process retry.

Code-level mechanics

1) Bus options include reconnect settings

core/infra/bus/nats.go
go
opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Info("bus: disconnected from nats", "err", err)
  }),
  nats.ReconnectHandler(func(nc *nats.Conn) {
    slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
  }),
  nats.ClosedHandler(func(nc *nats.Conn) {
    slog.Info("bus: connection closed")
  }),
}

nc, err := nats.Connect(url, opts...)
if err != nil {
  return nil, fmt.Errorf("connect nats %s: %w", url, err)
}

2) Startup exits if initial connect fails

cmd/cordum-scheduler/main.go
go
natsBus, err := bus.NewNatsBus(cfg.NatsURL)
if err != nil {
  slog.Error("failed to connect to NATS", "error", err)
  os.Exit(1)
}

3) Optional patch: retry on failed initial connect

core/infra/bus/nats.go (example)
go
opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.RetryOnFailedConnect(true),
  nats.ReconnectJitter(250*time.Millisecond, 1*time.Second),
}

nc, err := nats.Connect(url, opts...)
if err != nil {
  return nil, fmt.Errorf("connect nats %s: %w", url, err)
}

This patch changes failure mode, not correctness by itself. If you adopt it, you also need clear liveness and readiness semantics so traffic does not hit a service that is still reconnecting.

Operator runbook

Validate behavior in staging before changing production defaults. Measure restart count and time-to-recover under controlled NATS outages.

staging-runbook.sh
bash
# 1) Confirm current behavior in staging
kubectl -n cordum scale statefulset nats --replicas=0
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum get pods -w | rg cordum-scheduler

# Expected today: CrashLoopBackOff until NATS returns.

# 2) Restore NATS and verify recovery
kubectl -n cordum scale statefulset nats --replicas=3
kubectl -n cordum rollout status deploy/cordum-scheduler

# 3) If you switch to RetryOnFailedConnect(true), verify process no longer restarts
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected|reconnected|connect nats"

# 4) Track restart SLO
kubectl -n cordum get pods -o json   | jq '.items[] | {name: .metadata.name, restarts: .status.containerStatuses[0].restartCount}'

Limitations and tradeoffs

ApproachBenefitsCosts
Fail-fast + restart policySimple mental model. Fast crash signal when dependency is unavailable.Noisy restart loops during dependency outages. Can hide real regressions in alert noise.
RetryOnFailedConnect at startupFewer process restarts. Better continuity during short broker outages.Service may stay up but not ready. Requires strong readiness gating and timeout policy.
Init wait for NATS endpointKeeps app code unchanged. Clear startup dependency ordering.Adds startup latency and extra lifecycle wiring. Still needs outage handling after boot.

Next step

Decide your startup policy explicitly this week, write it into your runbook, and test it by forcing a NATS outage during a controlled rollout.

Related Reads