Name: Cordum
Author: Cordum

The production problem

Cert expiry incidents are boring until they happen at 2 AM. Then they become very educational.

Rotating a Safety Kernel cert is high impact because the scheduler and gateway depend on that channel for pre-dispatch decisions. A broken handshake path can degrade governance availability fast.

The failure mode is usually not cryptography. It is rollout sequencing, reconnect assumptions, and missing rollback criteria.

What top results miss

Source	Strong coverage	Missing piece
gRPC Authentication Guide	TLS/mTLS channel authentication fundamentals and credential API model.	No service-specific certificate reload and reconnect runbook for long-lived control-plane channels.
Kubernetes kubelet certificate rotation	Automatic renewal flow and reconnect behavior as cert expiry approaches.	No app-level `GetCertificate` callback strategy for gRPC listeners with custom keypairs.
SPIRE use cases	Short-lived, automatically rotated workload certificates for mTLS.	No direct guidance on mixed static CA files plus app-managed cert reload logic.

The gap is application-level rotation choreography: watcher cadence, handshake cutover timing, and client reconnect policy.

Rotation model

Phase	Objective	Failure mode
Pre-rotation validation	Confirm current cert CN/SAN, expiry horizon, and chain trust	Rotate blindly and discover trust mismatch only after production reconnect
Write new cert+key atomically	Update disk artifacts in one deploy step	Split writes produce temporary invalid keypair and reload errors
Wait for reload window	Allow watcher tick to pick up file modifications (<=30s default)	Immediate validation before watch tick creates false incident signal
Force controlled reconnects	Move existing channels onto new server cert	Long-lived channels continue old handshake state longer than expected
Rollback readiness	Keep previous keypair available until post-rotation checks pass	No rollback path during trust-chain or hostname validation failures

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Initial load gate	`NewCertReloader` loads cert/key at startup and fails fast on invalid keypair.	Prevents boot with broken TLS material.
Reload trigger	`WatchLoop` polls cert/key file mtimes; interval defaults to `30s` when non-positive.	Predictable reload latency bound after secret update.
Handshake behavior	`tls.Config.GetCertificate` returns latest in-memory cert from reloader.	New inbound connections pick up rotated cert without process restart.
Existing connections	Reloader swaps cert for future handshakes; established channels are not forcibly reset.	You need reconnect strategy to complete migration quickly.
Client trust material	Scheduler/gateway TLS CA file is read when creating transport credentials.	CA bundle changes require redial path, restart, or controlled reconnection.

Implementation examples

Reloader watch loop (Go)

tlsreload_watch_loop.go

func (r *CertReloader) WatchLoop(ctx context.Context, interval time.Duration) {
  if interval <= 0 {
    interval = 30 * time.Second
  }
  lastCertMod, lastKeyMod := r.modTimes()
  ticker := time.NewTicker(interval)
  defer ticker.Stop()

  for {
    select {
    case <-ctx.Done():
      return
    case <-ticker.C:
      certMod, keyMod := r.modTimes()
      if certMod.Equal(lastCertMod) && keyMod.Equal(lastKeyMod) {
        continue
      }
      if err := r.reload(); err != nil {
        slog.Error("tls cert reload failed", "label", r.label, "error", err)
        continue
      }
      lastCertMod, lastKeyMod = certMod, keyMod
      slog.Info("tls cert reloaded", "label", r.label)
    }
  }
}

Safety Kernel integration (Go)

safety_kernel_tls_reloader.go

reloader, err := tlsreload.NewCertReloader(certPath, keyPath, "safety-kernel")
if err != nil {
  return fmt.Errorf("safety kernel tls keypair: %w", err)
}
go reloader.WatchLoop(context.Background(), 30*time.Second)

tlsCfg := &tls.Config{
  GetCertificate: reloader.GetCertificate,
  MinVersion:     tls.VersionTLS12,
}
serverCreds := grpc.Creds(credentials.NewTLS(tlsCfg))

Rotation runbook

safety_kernel_cert_rotation.sh

Bash

# 1) Stage new cert and key files
kubectl -n cordum create secret tls safety-kernel-tls --cert=server-new.crt --key=server-new.key   --dry-run=client -o yaml | kubectl apply -f -

# 2) Confirm reload log appears (watch loop is 30s by default)
kubectl logs -n cordum deploy/cordum-safety-kernel | grep "tls cert reloaded"

# 3) Force controlled reconnect from clients (example: rollout restart)
kubectl rollout restart deploy/cordum-scheduler -n cordum
kubectl rollout restart deploy/cordum-gateway -n cordum

# 4) Verify handshake uses new cert fingerprint before removing old material
# (use your normal TLS probe / openssl check against SAFETY_KERNEL_ADDR)

Limitations and tradeoffs

- Polling reload (30s default) is simple and robust, but not instant.
- Hot reload avoids full server restart, but does not migrate active client channels automatically.
- Fast cert turnover can pressure operational debugging if fingerprints and expiry are not logged.
- CA rotation is a separate concern from leaf cert rotation and needs explicit client lifecycle handling.

If you rotate cert files but never recycle long-lived clients, you can pass superficial checks while old channels keep running on stale handshake state.

Next step

Run one full rotation rehearsal this week:

1. Rotate Safety Kernel cert/key in staging and confirm `tls cert reloaded` log appears.
2. Measure cutover delay from secret update to first successful new-handshake probe.
3. Trigger controlled scheduler/gateway reconnect and verify new cert fingerprint in traffic checks.
4. Execute rollback once to verify old keypair reactivation path and alert noise profile.

Continue with Safety Kernel TLS Hardening and Policy URL SSRF Hardening.

AI Agent Safety Kernel Certificate Rotation