Skip to content
Deep Dive

AI Agent NATS TLS Enforcement

Most transport incidents begin as a config drift story, not a cryptography story.

Deep Dive10 min readApr 2026
TL;DR
  • -In production, Cordum rejects `nats://` URLs unless `CORDUM_NATS_ALLOW_PLAINTEXT=true` is explicitly set.
  • -TLS envs are only applied for `tls://` URLs, reducing accidental mixed-mode assumptions.
  • -Auth is layered separately (`NATS_USERNAME/NATS_PASSWORD`, `NATS_TOKEN`, `NATS_NKEY`) and warned when missing in production.
  • -Current override has no built-in expiry. If enabled, it can silently become permanent unless you add policy controls.
  • -The plaintext override is useful for break-glass recovery, but it is a risk surface that must be monitored.
Secure default

Production mode blocks plaintext NATS transport by default.

Explicit override

`CORDUM_NATS_ALLOW_PLAINTEXT` exists, but it is an explicit opt-out and should be temporary.

Auth layering

Transport encryption and broker auth are separate controls. You want both.

Scope

This guide covers transport-level security between scheduler and NATS broker. It does not cover policy engine TLS or worker-to-external-service encryption.

The production problem

A plaintext broker endpoint can sneak into production during emergency edits, copied manifests, or rushed migrations.

Without an explicit startup gate, the system often keeps running and you only discover the drift during an audit or incident review.

The fix is simple: fail fast when transport security is misconfigured, then force a conscious override when you truly need break-glass behavior.

What top results miss

SourceStrong coverageMissing piece
NATS docs: Enabling TLSServer-side TLS map (`cert_file`, `key_file`, `ca_file`, `min_version`, timeout, SAN pitfalls).No platform-specific production guardrails that block plaintext by default at app startup.
NATS docs: Encrypting Connections with TLSClient-side TLS connect patterns (`tls://`, client cert, root CAs, mTLS setup).No opinionated guidance for break-glass plaintext overrides in production apps.
nats.go TLS examplesPractical client options (`RootCAs`, `ClientCert`, `Secure(config)`) and TLS scheme behavior.No app-level policy for override expiry or auditability.

Broker docs explain how to set TLS. They usually do not provide an opinionated application boot guard that rejects insecure transport by default.

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Production gateIf environment is production and URL is not `tls://`, startup returns error unless override is enabled.Plaintext broker drift is blocked before scheduler begins processing traffic.
Override knob`CORDUM_NATS_ALLOW_PLAINTEXT=true` bypasses the production TLS gate.Supports emergency operation at the cost of transport-security downgrade.
Override lifetimeCurrent override is boolean-only; there is no built-in expiry timestamp.A temporary exception can persist indefinitely unless external policy enforces rollback.
TLS env applicationTLS env variables are evaluated only when URL uses `tls://` scheme.Avoids false confidence where TLS files exist but plaintext URL is still used.
Auth policyAuth options are applied in priority order: user/pass, token, then nkey seed.Transport and identity controls are independently configurable.
Production auth warningIf production starts without broker auth, bus logs a warning.Signals insecure identity posture even when transport encryption is present.
Test coverageTests cover plaintext rejection, override path, dev-mode allowance, and auth option selection.Regression risk is lower during refactors of connection boot logic.

Override lease policy

TLS enforcement is only as strong as your override discipline. Treat plaintext override as a lease, not a toggle.

StateRequired ruleExpected outcome
Default production posture`NATS_URL` must use `tls://` and `CORDUM_NATS_ALLOW_PLAINTEXT` must be unset.Process fails fast on insecure transport config.
Emergency override activeOverride requires owner + explicit expiry timestamp in deployment metadata.Temporary exception is auditable and auto-expired by policy.
Expired overrideIf current time > override expiry, startup must fail even when override is `true`.Prevents forgotten downgrade from becoming steady-state.

Optional hardening patch: expiry-based override

core/infra/bus/nats.go (example)
Go
if production && !strings.HasPrefix(url, "tls://") {
  if !parseBoolEnv("CORDUM_NATS_ALLOW_PLAINTEXT") {
    return nil, fmt.Errorf("nats TLS required in production")
  }

  // Optional hardening: require an expiry timestamp for break-glass mode.
  untilRaw := strings.TrimSpace(os.Getenv("CORDUM_NATS_ALLOW_PLAINTEXT_UNTIL"))
  until, err := time.Parse(time.RFC3339, untilRaw)
  if err != nil || time.Now().After(until) {
    return nil, fmt.Errorf("plaintext override expired or invalid: CORDUM_NATS_ALLOW_PLAINTEXT_UNTIL=%q", untilRaw)
  }
}

Code-level mechanics

Production transport gate (Go)

core/infra/bus/nats.go
Go
// Enforce TLS in production: reject nats:// unless explicitly allowed.
if production && !strings.HasPrefix(url, "tls://") {
  if !parseBoolEnv("CORDUM_NATS_ALLOW_PLAINTEXT") {
    return nil, fmt.Errorf("nats TLS required in production: use tls:// scheme or set CORDUM_NATS_ALLOW_PLAINTEXT=true")
  }
  slog.Warn("bus: plaintext NATS allowed in production via override", "url", url)
}

TLS + auth layering (Go)

core/infra/bus/nats.go
Go
if strings.HasPrefix(url, "tls://") {
  tlsConfig, err := natsTLSConfigFromEnv()
  if err != nil { return nil, err }
  if tlsConfig != nil {
    opts = append(opts, nats.Secure(tlsConfig))
  }
}

authConfigured := natsApplyAuth(&opts)
if production && !authConfigured {
  slog.Warn("bus: NATS authentication not configured in production")
}

Regression tests for guard behavior (Go)

core/infra/bus/nats_test.go
Go
func TestNewNatsBus_PlaintextRejectedInProduction(t *testing.T) {
  t.Setenv("CORDUM_ENV", "production")
  t.Setenv("CORDUM_NATS_ALLOW_PLAINTEXT", "")
  _, err := NewNatsBus("nats://localhost:14222")
  // expect TLS enforcement error
}

func TestNewNatsBus_PlaintextAllowedWithOverride(t *testing.T) {
  t.Setenv("CORDUM_ENV", "production")
  t.Setenv("CORDUM_NATS_ALLOW_PLAINTEXT", "true")
  _, err := NewNatsBus("nats://localhost:14222")
  // should not fail on TLS enforcement check
}

Operator runbook

nats_tls_enforcement_runbook.sh
Bash
# 1) Confirm transport URL uses tls:// in production
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_URL

# 2) Verify plaintext override is not enabled
kubectl -n cordum exec deploy/cordum-scheduler -- printenv CORDUM_NATS_ALLOW_PLAINTEXT

# 3) If override is enabled, verify expiry and owner metadata
kubectl -n cordum exec deploy/cordum-scheduler -- printenv CORDUM_NATS_ALLOW_PLAINTEXT_UNTIL
kubectl -n cordum get deploy cordum-scheduler -o jsonpath="{.metadata.annotations.security\.cordum\.io/plaintext-owner}"

# 4) Verify TLS material variables
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_TLS_CA NATS_TLS_CERT NATS_TLS_KEY NATS_TLS_SERVER_NAME

# 5) Verify auth layer vars (at least one auth mode)
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_USERNAME NATS_TOKEN NATS_NKEY

# 6) Pre-deploy CI check
cd D:/Cordum/cordum
go test ./core/infra/bus -run "TestNewNatsBus_(PlaintextRejectedInProduction|PlaintextAllowedWithOverride|PlaintextAllowedInDev|TLSURLNotBlockedByEnforcement)"

Limitations and tradeoffs

DecisionBenefitCost
Keep strict startup TLS gatePrevents silent plaintext drift in production.Mis-rotated certs can block rollout until fixed.
Allow break-glass plaintext overrideSupports emergency continuity during TLS outages.Immediate confidentiality downgrade if not tightly controlled.
Require override expiry + ownerImproves auditability and rollback discipline.Needs extra policy wiring in deploy and review workflows.

If you ever enable plaintext override in production, treat it as an incident with explicit owner, start time, and rollback deadline.

Next step

Run a transport hardening check this week:

  1. 1. Verify all production NATS URLs use `tls://`.
  2. 2. Confirm plaintext override is unset.
  3. 3. Confirm at least one auth mode is configured.
  4. 4. Add CI test gate that fails on insecure production transport config.

Continue with JetStream Broadcast Semantics and MaxAckPending Tuning.

Transport security first

Encryption bugs are often configuration bugs. Fail fast and make insecure states expensive.