Skip to content
Deep Dive

AI Agent NATS TLS Enforcement

Most transport incidents begin as a config drift story, not a cryptography story.

Deep Dive10 min readMar 2026
TL;DR
  • -In production, Cordum rejects `nats://` URLs unless `CORDUM_NATS_ALLOW_PLAINTEXT=true` is explicitly set.
  • -TLS envs are only applied for `tls://` URLs, reducing accidental mixed-mode assumptions.
  • -Auth is layered separately (`NATS_USERNAME/NATS_PASSWORD`, `NATS_TOKEN`, `NATS_NKEY`) and warned when missing in production.
  • -The plaintext override is useful for break-glass recovery, but it is a risk surface that must be monitored.
Secure default

Production mode blocks plaintext NATS transport by default.

Explicit override

`CORDUM_NATS_ALLOW_PLAINTEXT` exists, but it is an explicit opt-out and should be temporary.

Auth layering

Transport encryption and broker auth are separate controls. You want both.

Scope

This guide covers transport-level security between scheduler and NATS broker. It does not cover policy engine TLS or worker-to-external-service encryption.

The production problem

A plaintext broker endpoint can sneak into production during emergency edits, copied manifests, or rushed migrations.

Without an explicit startup gate, the system often keeps running and you only discover the drift during an audit or incident review.

The fix is simple: fail fast when transport security is misconfigured, then force a conscious override when you truly need break-glass behavior.

What top results miss

SourceStrong coverageMissing piece
NATS TLS documentationHow to configure TLS certificates and secure NATS server/client transport.No platform-specific production guardrails that block plaintext by default at app startup.
RabbitMQ TLS supportBroker-side TLS enablement and certificate configuration for AMQP clients.No app-level environment enforcement gate like `reject non-TLS URL in production`.
Kafka SSL client configurationClient SSL properties and truststore/keystore setup for encrypted transport.No direct equivalent of runtime startup refusal when plaintext endpoint is configured.

Broker docs explain how to set TLS. They usually do not provide an opinionated application boot guard that rejects insecure transport by default.

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Production gateIf environment is production and URL is not `tls://`, startup returns error unless override is enabled.Plaintext broker drift is blocked before scheduler begins processing traffic.
Override knob`CORDUM_NATS_ALLOW_PLAINTEXT=true` bypasses the production TLS gate.Supports emergency operation at the cost of transport-security downgrade.
TLS env applicationTLS env variables are evaluated only when URL uses `tls://` scheme.Avoids false confidence where TLS files exist but plaintext URL is still used.
Auth policyAuth options are applied in priority order: user/pass, token, then nkey seed.Transport and identity controls are independently configurable.
Production auth warningIf production starts without broker auth, bus logs a warning.Signals insecure identity posture even when transport encryption is present.
Test coverageTests cover plaintext rejection, override path, dev-mode allowance, and auth option selection.Regression risk is lower during refactors of connection boot logic.

Code-level mechanics

Production transport gate (Go)

core/infra/bus/nats.go
Go
// Enforce TLS in production: reject nats:// unless explicitly allowed.
if production && !strings.HasPrefix(url, "tls://") {
  if !parseBoolEnv("CORDUM_NATS_ALLOW_PLAINTEXT") {
    return nil, fmt.Errorf("nats TLS required in production: use tls:// scheme or set CORDUM_NATS_ALLOW_PLAINTEXT=true")
  }
  slog.Warn("bus: plaintext NATS allowed in production via override", "url", url)
}

TLS + auth layering (Go)

core/infra/bus/nats.go
Go
if strings.HasPrefix(url, "tls://") {
  tlsConfig, err := natsTLSConfigFromEnv()
  if err != nil { return nil, err }
  if tlsConfig != nil {
    opts = append(opts, nats.Secure(tlsConfig))
  }
}

authConfigured := natsApplyAuth(&opts)
if production && !authConfigured {
  slog.Warn("bus: NATS authentication not configured in production")
}

Regression tests for guard behavior (Go)

core/infra/bus/nats_test.go
Go
func TestNewNatsBus_PlaintextRejectedInProduction(t *testing.T) {
  t.Setenv("CORDUM_ENV", "production")
  t.Setenv("CORDUM_NATS_ALLOW_PLAINTEXT", "")
  _, err := NewNatsBus("nats://localhost:14222")
  // expect TLS enforcement error
}

func TestNewNatsBus_PlaintextAllowedWithOverride(t *testing.T) {
  t.Setenv("CORDUM_ENV", "production")
  t.Setenv("CORDUM_NATS_ALLOW_PLAINTEXT", "true")
  _, err := NewNatsBus("nats://localhost:14222")
  // should not fail on TLS enforcement check
}

Operator runbook

nats_tls_enforcement_runbook.sh
Bash
# 1) Confirm transport URL uses tls:// in production
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_URL

# 2) Verify plaintext override is not enabled
kubectl -n cordum exec deploy/cordum-scheduler -- printenv CORDUM_NATS_ALLOW_PLAINTEXT

# 3) Verify TLS material variables
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_TLS_CA NATS_TLS_CERT NATS_TLS_KEY NATS_TLS_SERVER_NAME

# 4) Verify auth layer vars (at least one auth mode)
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_USERNAME NATS_TOKEN NATS_NKEY

# 5) Pre-deploy CI check
cd D:/Cordum/cordum
go test ./core/infra/bus -run "TestNewNatsBus_PlaintextRejectedInProduction|TestNewNatsBus_PlaintextAllowedWithOverride"

Limitations and tradeoffs

  • - Break-glass plaintext override improves operability but weakens transport security immediately.
  • - TLS without broker auth protects confidentiality, not identity.
  • - Auth without TLS protects identity, not payload privacy on the wire.
  • - Strict startup gating can block deploys when cert material is mis-rotated.

If you ever enable plaintext override in production, treat it as an incident with explicit owner, start time, and rollback deadline.

Next step

Run a transport hardening check this week:

  1. 1. Verify all production NATS URLs use `tls://`.
  2. 2. Confirm plaintext override is unset.
  3. 3. Confirm at least one auth mode is configured.
  4. 4. Add CI test gate that fails on insecure production transport config.

Continue with JetStream Broadcast Semantics and MaxAckPending Tuning.

Transport security first

Encryption bugs are often configuration bugs. Fail fast and make insecure states expensive.