Name: Cordum
Author: Cordum

The production problem

Team runs a maintenance restart. NATS comes up slower than scheduler and gateway. Services crash on boot, then restart in a loop.

Operators are confused because they configured infinite reconnect. They expected the process to wait and recover by itself.

The catch is simple. Reconnect settings help after a connection existed. First-boot dial failure is a different code path.

What top results miss

Source	Strong coverage	Missing piece
NATS docs: Pausing Between Reconnect Attempts	Reconnect interval tuning and thundering herd control.	Does not map reconnect knobs to application startup behavior when first connect fails.
NATS docs: Buffering During Reconnect Attempts	Client-side reconnect buffer semantics and delivery caveats.	Focuses on reconnecting clients, not services that never reached connected state.
nats.go package docs	Options such as `MaxReconnects`, `ReconnectWait`, `ReconnectBufSize`, and `RetryOnFailedConnect`.	No control-plane guidance on choosing fail-fast startup versus initial retry in multi-service deployments.

Cordum runtime behavior

Boundary	Observed behavior	Operational impact
Client options in bus init	Cordum sets `nats.MaxReconnects(-1)` and `nats.ReconnectWait(2 * time.Second)`.	Established connections recover indefinitely with a fixed wait cadence.
Initial dial path	`NewNatsBus` calls `nats.Connect(url, opts...)` and returns error on failure.	If NATS is down at boot, caller gets immediate startup error.
Scheduler startup	`cmd/cordum-scheduler/main.go` exits with `os.Exit(1)` when NATS connect fails.	Kubernetes or systemd restarts process until NATS is reachable.
Gateway and workflow-engine startup	Both return startup error when `NewNatsBus` fails.	System availability follows dependency ordering and restart policy quality.
RetryOnFailedConnect	Not currently set in Cordum bus options.	Initial failure handling stays fail-fast rather than in-process retry.

Code-level mechanics

1) Bus options include reconnect settings

core/infra/bus/nats.go

opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.DisconnectErrHandler(func(nc *nats.Conn, err error) {
    slog.Info("bus: disconnected from nats", "err", err)
  }),
  nats.ReconnectHandler(func(nc *nats.Conn) {
    slog.Info("bus: reconnected to nats", "url", nc.ConnectedUrl())
  }),
  nats.ClosedHandler(func(nc *nats.Conn) {
    slog.Info("bus: connection closed")
  }),
}

nc, err := nats.Connect(url, opts...)
if err != nil {
  return nil, fmt.Errorf("connect nats %s: %w", url, err)
}

2) Startup exits if initial connect fails

cmd/cordum-scheduler/main.go

natsBus, err := bus.NewNatsBus(cfg.NatsURL)
if err != nil {
  slog.Error("failed to connect to NATS", "error", err)
  os.Exit(1)
}

3) Optional patch: retry on failed initial connect

core/infra/bus/nats.go (example)

opts := []nats.Option{
  nats.Name("cordum-bus"),
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
  nats.RetryOnFailedConnect(true),
  nats.ReconnectJitter(250*time.Millisecond, 1*time.Second),
}

nc, err := nats.Connect(url, opts...)
if err != nil {
  return nil, fmt.Errorf("connect nats %s: %w", url, err)
}

This patch changes failure mode, not correctness by itself. If you adopt it, you also need clear liveness and readiness semantics so traffic does not hit a service that is still reconnecting.

Operator runbook

Validate behavior in staging before changing production defaults. Measure restart count and time-to-recover under controlled NATS outages.

staging-runbook.sh

bash

# 1) Confirm current behavior in staging
kubectl -n cordum scale statefulset nats --replicas=0
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum get pods -w | rg cordum-scheduler

# Expected today: CrashLoopBackOff until NATS returns.

# 2) Restore NATS and verify recovery
kubectl -n cordum scale statefulset nats --replicas=3
kubectl -n cordum rollout status deploy/cordum-scheduler

# 3) If you switch to RetryOnFailedConnect(true), verify process no longer restarts
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected|reconnected|connect nats"

# 4) Track restart SLO
kubectl -n cordum get pods -o json   | jq '.items[] | {name: .metadata.name, restarts: .status.containerStatuses[0].restartCount}'

Limitations and tradeoffs

Approach	Benefits	Costs
Fail-fast + restart policy	Simple mental model. Fast crash signal when dependency is unavailable.	Noisy restart loops during dependency outages. Can hide real regressions in alert noise.
RetryOnFailedConnect at startup	Fewer process restarts. Better continuity during short broker outages.	Service may stay up but not ready. Requires strong readiness gating and timeout policy.
Init wait for NATS endpoint	Keeps app code unchanged. Clear startup dependency ordering.	Adds startup latency and extra lifecycle wiring. Still needs outage handling after boot.

Next step

Decide your startup policy explicitly this week, write it into your runbook, and test it by forcing a NATS outage during a controlled rollout.

Review deployment docs Check operations guide

AI Agent NATS Cold-Start Reconnect