Skip to content
Deep Dive

AI Agent NATS Client Certificate Rotation

Server reload helps the broker. It does not rotate certs already loaded by your running control-plane clients.

Deep Dive11 min readApr 2026
TL;DR
  • -NATS docs explain TLS setup and server reload, but not the full client-rotation playbook for long-lived Go control planes.
  • -NATS server `ca_file` accepts CA bundles, so you can run an old+new trust overlap window during migration.
  • -Cordum currently loads NATS TLS keypair from disk at startup and does not implement in-process certificate hot reload.
  • -Safe rotation order is: broaden server trust, restart clients in waves, then remove old trust.
  • -If you need no-restart client cert rotation, use callback-driven TLS material in `nats.go` and test reconnect behavior before production rollout.
Real outage mode

Expired client certs break reconnect loops first. The broker is often healthy.

Current baseline

Cordum loads cert and key once during NATS client setup.

Safe path

Rotate with overlap windows and staged restart math, not big-bang replacement.

Scope

This guide focuses on NATS client-certificate rotation in Cordum-style control planes. It does not cover full CA lifecycle tooling design.

The production problem

Teams rotate broker certificates and send `SIGHUP` to `nats-server`. Everything looks clean on paper.

Then control-plane pods start reconnecting with stale client certs and auth failures. Nothing says pager fatigue like certificate expiry at 03:00.

The failure mode is simple: server cert lifecycle and client cert lifecycle are separate jobs. Many runbooks treat them as one.

What top results cover and miss

SourceStrong coverageMissing piece
NATS docs: Enabling TLSServer TLS fields (`cert_file`, `key_file`, `ca_file`, `min_version`, client verify settings), including bundled CA files.No strict rollout order for dual-CA trust windows with long-lived control-plane clients.
NATS docs: Signals`--signal reload` (`SIGHUP`) reloads server configuration.Does not solve client-side cert material already loaded in running applications.
nats.go package docs`ClientCert`, `RootCAs`, and callback-based `ClientTLSConfig` for cert/CA material.No production migration playbook from file-loaded static certs to callback-based reload behavior.

Cordum runtime reality

In `core/infra/bus/nats.go`, Cordum builds options at startup, creates TLS config from environment, and then dials NATS.

That is deterministic and easy to reason about. It also means cert/key changes on disk are not automatically applied to already-running processes.

AreaCurrent behaviorOperational impact
Connection setupCordum builds NATS options once in `NewNatsBus` and calls `nats.Connect(url, opts...)`.TLS config shape is decided at process startup.
TLS material loading`natsTLSConfigFromEnv()` reads `NATS_TLS_CA`, `NATS_TLS_CERT`, `NATS_TLS_KEY` and calls `tls.LoadX509KeyPair`.Certificate/key are loaded from file during startup path, not by file watchers.
Trust anchor updatesNATS server `ca_file` can include multiple CAs, and `--signal reload` refreshes server config.Use this to run a temporary old+new trust overlap while clients rotate.
Reconnect behaviorCordum uses `MaxReconnects(-1)` with `ReconnectWait(2 * time.Second)` and reconnect/disconnect callbacks.Reconnect storms can continue forever if certificate auth is invalid until process is restarted or TLS material changes.
Default timingNATS docs default TLS handshake timeout is 2 seconds; Cordum reconnect wait is 2 seconds.Each failed attempt can burn roughly 4 seconds before the next cycle under cert mismatch conditions.
Cordum connection setup
go
opts := []nats.Option{
  nats.MaxReconnects(-1),
  nats.ReconnectWait(2 * time.Second),
}

if strings.HasPrefix(url, "tls://") {
  tlsConfig, err := natsTLSConfigFromEnv()
  if err != nil {
    return nil, fmt.Errorf("nats tls config: %w", err)
  }
  if tlsConfig != nil {
    opts = append(opts, nats.Secure(tlsConfig))
  }
}

nc, err := nats.Connect(url, opts...)
Cordum TLS material loading
go
func natsTLSConfigFromEnv() (*tls.Config, error) {
  certPath := strings.TrimSpace(os.Getenv("NATS_TLS_CERT"))
  keyPath := strings.TrimSpace(os.Getenv("NATS_TLS_KEY"))

  cfg := &tls.Config{MinVersion: tls.VersionTLS12}

  if certPath != "" || keyPath != "" {
    if certPath == "" || keyPath == "" {
      return nil, fmt.Errorf("nats tls cert/key must be set together")
    }
    cert, err := tls.LoadX509KeyPair(certPath, keyPath)
    if err != nil {
      return nil, fmt.Errorf("nats tls keypair: %w", err)
    }
    cfg.Certificates = []tls.Certificate{cert}
  }

  return cfg, nil
}

Rotation rollout that holds up

Use overlap windows for old and new client cert trust. Rotate workload pods in small waves. Validate reconnect/auth logs before each wave.

The order is the key detail many guides skip: broaden trust first, rotate clients second, remove old trust last.

Dual-CA trust bundle pattern
yaml
# Build dual-CA bundle for temporary overlap
cat old-client-ca.pem new-client-ca.pem > client-ca-bundle.pem

tls: {
  cert_file: "./server-cert.pem"
  key_file: "./server-key.pem"
  ca_file: "./client-ca-bundle.pem"
  verify: true
}

Keep rollout windows bounded. If auth errors rise between waves, stop and fix trust material before continuing.

Simple rollout budget script
bash
# 1) Broaden server trust first (old + new client CA), then reload server
cat old-client-ca.pem new-client-ca.pem > /etc/nats/client-ca-bundle.pem
nats-server --signal reload=/var/run/nats/nats-server.pid

# 2) Publish new client cert secret for Cordum workloads
kubectl -n cordum apply -f scheduler-nats-mtls-secret.yaml

# 3) Restart workloads in small waves
kubectl -n cordum rollout restart deploy/cordum-scheduler
kubectl -n cordum rollout status deploy/cordum-scheduler --timeout=10m

# 4) Validate reconnect/auth logs after each wave
kubectl -n cordum logs deploy/cordum-scheduler | rg "disconnected from nats|reconnected to nats|tls|certificate|auth"

# 5) After full convergence, remove old CA and reload server again
cat new-client-ca.pem > /etc/nats/client-ca-bundle.pem
nats-server --signal reload=/var/run/nats/nats-server.pid

If your SLA budget is tight, reduce `maxUnavailable`, shrink each wave, and finish rotation faster by parallelizing across independent control-plane roles.

Optional hot-reload design

`nats.go` exposes `ClientTLSConfig` with certificate and root-CA callbacks. That can support a no-restart model if your TLS material source is reloadable.

This is more code and more failure modes. Only adopt it after proving reconnect semantics in staging under forced certificate swaps.

Callback-driven TLS material pattern
go
// Optional design for no-restart cert rotation.
// Keep current static approach unless you can validate this behavior in staging.

type TLSMaterial interface {
  Cert() (tls.Certificate, error)
  RootCAs() (*x509.CertPool, error)
}

func NatsOptionsFromMaterial(m TLSMaterial) []nats.Option {
  return []nats.Option{
    nats.ClientTLSConfig(
      func() (tls.Certificate, error) { return m.Cert() },
      func() (*x509.CertPool, error) { return m.RootCAs() },
    ),
  }
}

// Rotation note:
// replace m's backing cert/CA atomically,
// then force a reconnect test to verify the new material is used.

Limitations and tradeoffs

ApproachUpsideDownside
Dual-CA overlap + staged restarts (current Cordum-friendly)Low code risk, predictable rollout, easy rollback by redeploying previous secret version.Requires restart orchestration and cert overlap window discipline.
Callback-based client TLS materialPotentially reduces restart dependency for cert changes.More implementation complexity and stricter staging validation requirements.
Big-bang replacementOperationally simple on paper.Highest blast radius. Usually becomes a pager event.

If you are on static startup loading today, that is fine. Just design the operational process around it instead of assuming runtime magic exists.

Next step

Run one controlled rotation drill in staging this week: new cert issuance, staggered scheduler restarts, reconnect/auth log checks, and old-cert revocation only after full convergence.

Then automate the same sequence in your release pipeline so certificate rotation is a routine change, not an incident template.

Related Articles

View all posts

Need production-safe agent governance?

Cordum helps teams enforce pre-dispatch policy, run dependable agent workflows, and keep evidence trails auditable.