Skip to content
Deep Dive

AI Agent NATS Client Certificate Rotation

Server reload helps the broker. It does not rotate certs already loaded by your running control-plane clients.

Deep Dive11 min readMar 2026
TL;DR
  • -NATS docs explain TLS setup and server config reload, but not the full client-rotation playbook for long-lived Go control planes.
  • -Cordum currently loads NATS TLS keypair from disk at startup and does not implement in-process certificate hot reload.
  • -Safe rotation today is overlap + staged restarts + reconnect observability, not a single cert swap event.
  • -If you need no-restart client cert rotation, use callback-driven TLS material in `nats.go` and test reconnect behavior before production rollout.
Real outage mode

Expired client certs break reconnect loops first. The broker is often healthy.

Current baseline

Cordum loads cert and key once during NATS client setup.

Safe path

Rotate with overlap windows and staged restart math, not big-bang replacement.

Scope

This guide focuses on NATS client-certificate rotation in Cordum-style control planes. It does not cover full CA lifecycle tooling design.

The production problem

Teams rotate broker certificates and send `SIGHUP` to `nats-server`. Everything looks clean on paper.

Then control-plane pods start reconnecting with stale client certs and auth failures. Nothing says pager fatigue like certificate expiry at 03:00.

The failure mode is simple: server cert lifecycle and client cert lifecycle are separate jobs. Many runbooks treat them as one.

What top results cover and miss

SourceStrong coverageMissing piece
NATS docs: Enabling TLSServer TLS fields (`cert_file`, `key_file`, `ca_file`, `min_version`, client verify settings).No end-to-end rollout sequence for rotating mTLS client certs across long-lived control-plane processes.
NATS docs: Signals`--signal reload` (`SIGHUP`) reloads server configuration.Does not solve client-side cert material already loaded in running applications.
nats.go package docs`ClientCert`, `RootCAs`, and callback-based `ClientTLSConfig` for cert/CA material.No production migration playbook from file-loaded static certs to callback-based reload behavior.

Cordum runtime reality

In `core/infra/bus/nats.go`, Cordum builds options at startup, creates TLS config from environment, and then dials NATS.

That is deterministic and easy to reason about. It also means cert/key changes on disk are not automatically applied to already-running processes.

AreaCurrent behaviorOperational impact
Connection setupCordum builds NATS options once in `NewNatsBus` and calls `nats.Connect(url, opts...)`.TLS config shape is decided at process startup.
TLS material loading`natsTLSConfigFromEnv()` reads `NATS_TLS_CA`, `NATS_TLS_CERT`, `NATS_TLS_KEY` and calls `tls.LoadX509KeyPair`.Certificate/key are loaded from file during startup path, not by file watchers.
Reconnect behaviorCordum uses `MaxReconnects(-1)` with `ReconnectWait(2 * time.Second)` and reconnect/disconnect callbacks.Reconnect storms can continue forever if certificate auth is invalid until process is restarted or TLS material changes.
Default timingNATS docs default TLS handshake timeout is 2 seconds; Cordum reconnect wait is 2 seconds.Each failed attempt can burn roughly 4 seconds before the next cycle under cert mismatch conditions.
Cordum connection setup
go
opts := []nats.Option{
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
}

if strings.HasPrefix(url, "tls://") {
  tlsConfig, err := natsTLSConfigFromEnv()
  if err != nil {
    return nil, fmt.Errorf("nats tls config: %w", err)
  }
  if tlsConfig != nil {
    opts = append(opts, nats.Secure(tlsConfig))
  }
}

nc, err := nats.Connect(url, opts...)
Cordum TLS material loading
go
func natsTLSConfigFromEnv() (*tls.Config, error) {
  certPath := strings.TrimSpace(os.Getenv("NATS_TLS_CERT"))
  keyPath := strings.TrimSpace(os.Getenv("NATS_TLS_KEY"))

  cfg := &tls.Config{MinVersion: tls.VersionTLS12}

  if certPath != "" || keyPath != "" {
    if certPath == "" || keyPath == "" {
      return nil, fmt.Errorf("nats tls cert/key must be set together")
    }
    cert, err := tls.LoadX509KeyPair(certPath, keyPath)
    if err != nil {
      return nil, fmt.Errorf("nats tls keypair: %w", err)
    }
    cfg.Certificates = []tls.Certificate{cert}
  }

  return cfg, nil
}

Rotation rollout that holds up

Use overlap windows for old and new client cert trust. Rotate workload pods in small waves. Validate reconnect/auth logs before each wave.

Keep the math visible. With 2-second reconnect wait and 2-second handshake timeout defaults, your reconnect cycle budget is not infinite.

Simple rollout budget script
bash
# Rotation budget inputs
RECONNECT_WAIT_SEC=2
TLS_HANDSHAKE_TIMEOUT_SEC=2
PODS=24
MAX_UNAVAILABLE=3

# Worst-case reconnect cycle per pod (simple model)
CYCLE_SEC=$((RECONNECT_WAIT_SEC + TLS_HANDSHAKE_TIMEOUT_SEC))
# Estimated per-wave exposure window
WAVE_WINDOW_SEC=$((CYCLE_SEC * 2))

echo "cycle=${CYCLE_SEC}s wave_window=${WAVE_WINDOW_SEC}s"

# Rollout in small waves
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum rollout status deploy/cordum-scheduler --timeout=10m

# Validate reconnect and auth errors between waves
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats|tls|certificate|auth"

If your SLA budget is tight, reduce `maxUnavailable`, shrink each wave, and finish rotation faster by parallelizing across independent control-plane roles.

Optional hot-reload design

`nats.go` exposes `ClientTLSConfig` with certificate and root-CA callbacks. That can support a no-restart model if your TLS material source is reloadable.

This is more code and more failure modes. Only adopt it after proving reconnect semantics in staging under forced certificate swaps.

Callback-driven TLS material pattern
go
// Optional design for no-restart cert rotation.
// Keep current static approach unless you can validate this behavior in staging.

type TLSMaterial interface {
  Cert() (tls.Certificate, error)
  RootCAs() (*x509.CertPool, error)
}

func NatsOptionsFromMaterial(m TLSMaterial) []nats.Option {
  return []nats.Option{
    nats.ClientTLSConfig(
      func() (tls.Certificate, error) { return m.Cert() },
      func() (*x509.CertPool, error) { return m.RootCAs() },
    ),
  }
}

// Rotation note:
// replace m's backing cert/CA atomically,
// then force a reconnect test to verify the new material is used.

Limitations and tradeoffs

ApproachUpsideDownside
Overlap + staged restarts (current Cordum-friendly)Low code risk, predictable rollout, easy rollback by redeploying previous secret version.Requires restart orchestration and cert overlap window discipline.
Callback-based client TLS materialPotentially reduces restart dependency for cert changes.More implementation complexity and stricter staging validation requirements.
Big-bang replacementOperationally simple on paper.Highest blast radius. Usually becomes a pager event.

If you are on static startup loading today, that is fine. Just design the operational process around it instead of assuming runtime magic exists.

Next step

Run one controlled rotation drill in staging this week: new cert issuance, staggered scheduler restarts, reconnect/auth log checks, and old-cert revocation only after full convergence.

Then automate the same sequence in your release pipeline so certificate rotation is a routine change, not an incident template.

Related Articles

View all posts

Need production-safe agent governance?

Cordum helps teams enforce pre-dispatch policy, run dependable agent workflows, and keep evidence trails auditable.