Skip to content
Deep Dive

AI Agent Safety Kernel Certificate Rotation

No restart is convenient. No reconnect plan is risky.

Deep Dive11 min readMar 2026
TL;DR
  • -Cert rotation is not just secret replacement. You need explicit reload and reconnect behavior.
  • -Cordum Safety Kernel uses a certificate reloader with a 30-second watch loop by default.
  • -New TLS handshakes use the latest certificate; existing connections are not force-terminated by reload.
  • -Client CA material is loaded at dial time, so CA trust-bundle rotations need reconnection or restart planning.
Hot reload

Safety Kernel wires `tls.Config.GetCertificate` to a live cert reloader.

30s watch cadence

Reloader polls cert/key file modification times on a ticker; default interval is 30 seconds.

Connection boundary

Reload does not drop active channels. Plan reconnect windows for full certificate turnover.

Scope

This guide focuses on Safety Kernel gRPC server certificate rotation and client reconnect boundaries. It assumes TLS is already enforced in production.

The production problem

Cert expiry incidents are boring until they happen at 2 AM. Then they become very educational.

Rotating a Safety Kernel cert is high impact because the scheduler and gateway depend on that channel for pre-dispatch decisions. A broken handshake path can degrade governance availability fast.

The failure mode is usually not cryptography. It is rollout sequencing, reconnect assumptions, and missing rollback criteria.

What top results miss

SourceStrong coverageMissing piece
gRPC Authentication GuideTLS/mTLS channel authentication fundamentals and credential API model.No service-specific certificate reload and reconnect runbook for long-lived control-plane channels.
Kubernetes kubelet certificate rotationAutomatic renewal flow and reconnect behavior as cert expiry approaches.No app-level `GetCertificate` callback strategy for gRPC listeners with custom keypairs.
SPIRE use casesShort-lived, automatically rotated workload certificates for mTLS.No direct guidance on mixed static CA files plus app-managed cert reload logic.

The gap is application-level rotation choreography: watcher cadence, handshake cutover timing, and client reconnect policy.

Rotation model

PhaseObjectiveFailure mode
Pre-rotation validationConfirm current cert CN/SAN, expiry horizon, and chain trustRotate blindly and discover trust mismatch only after production reconnect
Write new cert+key atomicallyUpdate disk artifacts in one deploy stepSplit writes produce temporary invalid keypair and reload errors
Wait for reload windowAllow watcher tick to pick up file modifications (<=30s default)Immediate validation before watch tick creates false incident signal
Force controlled reconnectsMove existing channels onto new server certLong-lived channels continue old handshake state longer than expected
Rollback readinessKeep previous keypair available until post-rotation checks passNo rollback path during trust-chain or hostname validation failures

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Initial load gate`NewCertReloader` loads cert/key at startup and fails fast on invalid keypair.Prevents boot with broken TLS material.
Reload trigger`WatchLoop` polls cert/key file mtimes; interval defaults to `30s` when non-positive.Predictable reload latency bound after secret update.
Handshake behavior`tls.Config.GetCertificate` returns latest in-memory cert from reloader.New inbound connections pick up rotated cert without process restart.
Existing connectionsReloader swaps cert for future handshakes; established channels are not forcibly reset.You need reconnect strategy to complete migration quickly.
Client trust materialScheduler/gateway TLS CA file is read when creating transport credentials.CA bundle changes require redial path, restart, or controlled reconnection.

Implementation examples

Reloader watch loop (Go)

tlsreload_watch_loop.go
Go
func (r *CertReloader) WatchLoop(ctx context.Context, interval time.Duration) {
  if interval <= 0 {
    interval = 30 * time.Second
  }
  lastCertMod, lastKeyMod := r.modTimes()
  ticker := time.NewTicker(interval)
  defer ticker.Stop()

  for {
    select {
    case <-ctx.Done():
      return
    case <-ticker.C:
      certMod, keyMod := r.modTimes()
      if certMod.Equal(lastCertMod) && keyMod.Equal(lastKeyMod) {
        continue
      }
      if err := r.reload(); err != nil {
        slog.Error("tls cert reload failed", "label", r.label, "error", err)
        continue
      }
      lastCertMod, lastKeyMod = certMod, keyMod
      slog.Info("tls cert reloaded", "label", r.label)
    }
  }
}

Safety Kernel integration (Go)

safety_kernel_tls_reloader.go
Go
reloader, err := tlsreload.NewCertReloader(certPath, keyPath, "safety-kernel")
if err != nil {
  return fmt.Errorf("safety kernel tls keypair: %w", err)
}
go reloader.WatchLoop(context.Background(), 30*time.Second)

tlsCfg := &tls.Config{
  GetCertificate: reloader.GetCertificate,
  MinVersion:     tls.VersionTLS12,
}
serverCreds := grpc.Creds(credentials.NewTLS(tlsCfg))

Rotation runbook

safety_kernel_cert_rotation.sh
Bash
# 1) Stage new cert and key files
kubectl -n cordum create secret tls safety-kernel-tls --cert=server-new.crt --key=server-new.key   --dry-run=client -o yaml | kubectl apply -f -

# 2) Confirm reload log appears (watch loop is 30s by default)
kubectl logs -n cordum deploy/cordum-safety-kernel | grep "tls cert reloaded"

# 3) Force controlled reconnect from clients (example: rollout restart)
kubectl rollout restart deploy/cordum-scheduler -n cordum
kubectl rollout restart deploy/cordum-gateway -n cordum

# 4) Verify handshake uses new cert fingerprint before removing old material
# (use your normal TLS probe / openssl check against SAFETY_KERNEL_ADDR)

Limitations and tradeoffs

  • - Polling reload (30s default) is simple and robust, but not instant.
  • - Hot reload avoids full server restart, but does not migrate active client channels automatically.
  • - Fast cert turnover can pressure operational debugging if fingerprints and expiry are not logged.
  • - CA rotation is a separate concern from leaf cert rotation and needs explicit client lifecycle handling.

If you rotate cert files but never recycle long-lived clients, you can pass superficial checks while old channels keep running on stale handshake state.

Next step

Run one full rotation rehearsal this week:

  1. 1. Rotate Safety Kernel cert/key in staging and confirm `tls cert reloaded` log appears.
  2. 2. Measure cutover delay from secret update to first successful new-handshake probe.
  3. 3. Trigger controlled scheduler/gateway reconnect and verify new cert fingerprint in traffic checks.
  4. 4. Execute rollback once to verify old keypair reactivation path and alert noise profile.

Continue with Safety Kernel TLS Hardening and Policy URL SSRF Hardening.

Rotate like you expect an incident

Good rotations are boring. Boring is the goal.