The production problem
Safety checks sit on the hot path before dispatch. If that path can downgrade to plaintext transport, you have a control-plane blind spot.
An attacker who can intercept or reroute traffic between scheduler and Safety Kernel does not need a fancy model jailbreak. They need a network foothold and a weak transport policy.
This is why transport behavior has to be explicit at both ends: server listener posture and client credential selection.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC Authentication Guide | TLS and mTLS foundations for gRPC channels and auth primitives. | No environment-driven downgrade matrix for mixed-mode production control planes. |
| Istio PeerAuthentication Reference | STRICT vs PERMISSIVE mTLS policy posture in service meshes. | No app-level fallback behavior for direct gRPC clients outside mesh policy enforcement. |
| SPIRE Use Cases | Runtime workload identity with short-lived, automatically rotated mTLS credentials. | No direct mapping to Safety Kernel request paths and scheduler fail-mode interactions. |
The operational gap is downgrade handling: what exactly happens when certs are missing, CA files are absent, or insecure flags appear in production configs.
Downgrade risk model
| State | Risk | Guardrail |
|---|---|---|
| Server has no cert/key in production | Kernel starts plaintext or misconfigured | Fail startup when `SAFETY_KERNEL_TLS_CERT` is not set in production |
| Client has no CA and TLS required | Silent insecure dial to kernel | Return explicit error: `safety_kernel_tls_ca required` |
| Operator sets insecure flag in production | Bypass intended transport guarantees | Production mode still requires TLS paths unless strict requirements are changed |
| Outdated TLS protocol floor | Weaker transport properties | Use default production TLS 1.3 floor or enforce `CORDUM_TLS_MIN_VERSION=1.3` |
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Server keypair requirement | If one of `SAFETY_KERNEL_TLS_CERT`/`SAFETY_KERNEL_TLS_KEY` is missing, startup fails. | Prevents half-configured TLS from entering runtime. |
| Server production posture | Production mode rejects startup without TLS certificate configuration. | No plaintext Safety Kernel server in production by default. |
| Cert reload | TLS keypair reloader watch loop runs every 30 seconds. | Supports cert rotation without full redeploy. |
| Client TLS requirement | Client requires CA when in production or when `SAFETY_KERNEL_TLS_REQUIRED=true`. | Blocks accidental insecure dials when strict mode is expected. |
| Client insecure fallback | Insecure transport allowed only when TLS is not required and environment allows it. | Keeps local/dev workflow possible while preserving production safety defaults. |
| TLS protocol floor | `CORDUM_TLS_MIN_VERSION` controls minimum; production default resolves to TLS 1.3. | Avoids stale protocol baselines in production control planes. |
Implementation examples
Client transport selection (Go)
func safetyTransportCredentials() (credentials.TransportCredentials, error) {
caPath := strings.TrimSpace(os.Getenv("SAFETY_KERNEL_TLS_CA"))
requireTLS := env.IsProduction() || env.Bool("SAFETY_KERNEL_TLS_REQUIRED")
insecureAllowed := env.Bool("SAFETY_KERNEL_INSECURE")
if caPath == "" {
if requireTLS {
return nil, fmt.Errorf("safety_kernel_tls_ca required")
}
if insecureAllowed || !env.IsProduction() {
return insecure.NewCredentials(), nil
}
return nil, fmt.Errorf("safety kernel tls required")
}
// load CA and build tls.Config{RootCAs, MinVersion}
return credentials.NewTLS(cfg), nil
}Safety Kernel server TLS gate (Go)
serverCreds := grpc.Creds(insecure.NewCredentials())
cert := strings.TrimSpace(os.Getenv("SAFETY_KERNEL_TLS_CERT"))
key := strings.TrimSpace(os.Getenv("SAFETY_KERNEL_TLS_KEY"))
if cert != "" || key != "" {
if cert == "" || key == "" {
return fmt.Errorf("safety kernel tls requires both SAFETY_KERNEL_TLS_CERT and SAFETY_KERNEL_TLS_KEY")
}
reloader, _ := tlsreload.NewCertReloader(cert, key, "safety-kernel")
go reloader.WatchLoop(context.Background(), 30*time.Second)
serverCreds = grpc.Creds(credentials.NewTLS(&tls.Config{GetCertificate: reloader.GetCertificate}))
}
if env.IsProduction() && cert == "" {
return fmt.Errorf("safety kernel tls required in production")
}Baseline production env
# Safety Kernel server export SAFETY_KERNEL_TLS_CERT=/etc/cordum/tls/server.crt export SAFETY_KERNEL_TLS_KEY=/etc/cordum/tls/server.key # Scheduler/Gateway clients export SAFETY_KERNEL_TLS_CA=/etc/cordum/tls/ca.crt export SAFETY_KERNEL_TLS_REQUIRED=true export SAFETY_KERNEL_INSECURE=false # Global protocol floor export CORDUM_TLS_MIN_VERSION=1.3
Limitations and tradeoffs
- - Strict TLS requirements improve integrity but raise rollout coupling across services.
- - CA path mistakes fail fast, which is safer, but can impact availability during bad deploys.
- - Cert rotation watchers reduce restart pressure, but still need tested key distribution paths.
- - Mesh-level mTLS helps, yet app-level TLS checks remain necessary for defense in depth.
`SAFETY_KERNEL_INSECURE=true` is a controlled exception for non-production testing, not a production shortcut.
Next step
Run this transport hardening sequence this week:
- 1. Enforce server cert/key on all Safety Kernel pods and verify startup blocks missing keypairs.
- 2. Set `SAFETY_KERNEL_TLS_REQUIRED=true` and remove insecure exceptions from production values.
- 3. Validate `SAFETY_KERNEL_TLS_CA` distribution to scheduler and gateway before rollout.
- 4. Execute one cert-rotation drill and confirm connections remain healthy across the 30s watch window.
Continue with Policy URL SSRF Hardening and LLM Safety Kernel.