Name: Cordum
Author: Cordum

The production problem

Teams rotate broker certificates and send `SIGHUP` to `nats-server`. Everything looks clean on paper.

Then control-plane pods start reconnecting with stale client certs and auth failures. Nothing says pager fatigue like certificate expiry at 03:00.

The failure mode is simple: server cert lifecycle and client cert lifecycle are separate jobs. Many runbooks treat them as one.

What top results cover and miss

Source	Strong coverage	Missing piece
NATS docs: Enabling TLS	Server TLS fields (`cert_file`, `key_file`, `ca_file`, `min_version`, client verify settings).	No end-to-end rollout sequence for rotating mTLS client certs across long-lived control-plane processes.
NATS docs: Signals	`--signal reload` (`SIGHUP`) reloads server configuration.	Does not solve client-side cert material already loaded in running applications.
nats.go package docs	`ClientCert`, `RootCAs`, and callback-based `ClientTLSConfig` for cert/CA material.	No production migration playbook from file-loaded static certs to callback-based reload behavior.

Cordum runtime reality

In `core/infra/bus/nats.go`, Cordum builds options at startup, creates TLS config from environment, and then dials NATS.

That is deterministic and easy to reason about. It also means cert/key changes on disk are not automatically applied to already-running processes.

Area	Current behavior	Operational impact
Connection setup	Cordum builds NATS options once in `NewNatsBus` and calls `nats.Connect(url, opts...)`.	TLS config shape is decided at process startup.
TLS material loading	`natsTLSConfigFromEnv()` reads `NATS_TLS_CA`, `NATS_TLS_CERT`, `NATS_TLS_KEY` and calls `tls.LoadX509KeyPair`.	Certificate/key are loaded from file during startup path, not by file watchers.
Reconnect behavior	Cordum uses `MaxReconnects(-1)` with `ReconnectWait(2 * time.Second)` and reconnect/disconnect callbacks.	Reconnect storms can continue forever if certificate auth is invalid until process is restarted or TLS material changes.
Default timing	NATS docs default TLS handshake timeout is 2 seconds; Cordum reconnect wait is 2 seconds.	Each failed attempt can burn roughly 4 seconds before the next cycle under cert mismatch conditions.

Cordum connection setup

opts := []nats.Option{
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
}

if strings.HasPrefix(url, "tls://") {
  tlsConfig, err := natsTLSConfigFromEnv()
  if err != nil {
    return nil, fmt.Errorf("nats tls config: %w", err)
  }
  if tlsConfig != nil {
    opts = append(opts, nats.Secure(tlsConfig))
  }
}

nc, err := nats.Connect(url, opts...)

Cordum TLS material loading

func natsTLSConfigFromEnv() (*tls.Config, error) {
  certPath := strings.TrimSpace(os.Getenv("NATS_TLS_CERT"))
  keyPath := strings.TrimSpace(os.Getenv("NATS_TLS_KEY"))

  cfg := &tls.Config{MinVersion: tls.VersionTLS12}

  if certPath != "" || keyPath != "" {
    if certPath == "" || keyPath == "" {
      return nil, fmt.Errorf("nats tls cert/key must be set together")
    }
    cert, err := tls.LoadX509KeyPair(certPath, keyPath)
    if err != nil {
      return nil, fmt.Errorf("nats tls keypair: %w", err)
    }
    cfg.Certificates = []tls.Certificate{cert}
  }

  return cfg, nil
}

Rotation rollout that holds up

Use overlap windows for old and new client cert trust. Rotate workload pods in small waves. Validate reconnect/auth logs before each wave.

Keep the math visible. With 2-second reconnect wait and 2-second handshake timeout defaults, your reconnect cycle budget is not infinite.

Simple rollout budget script

bash

# Rotation budget inputs
RECONNECT_WAIT_SEC=2
TLS_HANDSHAKE_TIMEOUT_SEC=2
PODS=24
MAX_UNAVAILABLE=3

# Worst-case reconnect cycle per pod (simple model)
CYCLE_SEC=$((RECONNECT_WAIT_SEC + TLS_HANDSHAKE_TIMEOUT_SEC))
# Estimated per-wave exposure window
WAVE_WINDOW_SEC=$((CYCLE_SEC * 2))

echo "cycle=${CYCLE_SEC}s wave_window=${WAVE_WINDOW_SEC}s"

# Rollout in small waves
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum rollout status deploy/cordum-scheduler --timeout=10m

# Validate reconnect and auth errors between waves
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats|tls|certificate|auth"

If your SLA budget is tight, reduce `maxUnavailable`, shrink each wave, and finish rotation faster by parallelizing across independent control-plane roles.

Optional hot-reload design

`nats.go` exposes `ClientTLSConfig` with certificate and root-CA callbacks. That can support a no-restart model if your TLS material source is reloadable.

This is more code and more failure modes. Only adopt it after proving reconnect semantics in staging under forced certificate swaps.

Callback-driven TLS material pattern

// Optional design for no-restart cert rotation.
// Keep current static approach unless you can validate this behavior in staging.

type TLSMaterial interface {
  Cert() (tls.Certificate, error)
  RootCAs() (*x509.CertPool, error)
}

func NatsOptionsFromMaterial(m TLSMaterial) []nats.Option {
  return []nats.Option{
    nats.ClientTLSConfig(
      func() (tls.Certificate, error) { return m.Cert() },
      func() (*x509.CertPool, error) { return m.RootCAs() },
    ),
  }
}

// Rotation note:
// replace m's backing cert/CA atomically,
// then force a reconnect test to verify the new material is used.

Limitations and tradeoffs

Approach	Upside	Downside
Overlap + staged restarts (current Cordum-friendly)	Low code risk, predictable rollout, easy rollback by redeploying previous secret version.	Requires restart orchestration and cert overlap window discipline.
Callback-based client TLS material	Potentially reduces restart dependency for cert changes.	More implementation complexity and stricter staging validation requirements.
Big-bang replacement	Operationally simple on paper.	Highest blast radius. Usually becomes a pager event.

If you are on static startup loading today, that is fine. Just design the operational process around it instead of assuming runtime magic exists.

Next step

Run one controlled rotation drill in staging this week: new cert issuance, staggered scheduler restarts, reconnect/auth log checks, and old-cert revocation only after full convergence.

Then automate the same sequence in your release pipeline so certificate rotation is a routine change, not an incident template.

AI Agent NATS Client Certificate Rotation

The production problem

What top results cover and miss

Cordum runtime reality

Rotation rollout that holds up

Optional hot-reload design

Limitations and tradeoffs

Next step

Related Articles

AI Agent NATS Reconnect Observability: Turn Callback Logs into SLO Signals (2026)

AI Agent NATS Cold-Start Reconnect: Why Infinite Reconnect Still Exits on First Boot (2026)

AI Agent NATS TLS Enforcement: Block Plaintext Broker Drift in Production (2026)

Need production-safe agent governance?