The production problem
A plaintext broker endpoint can sneak into production during emergency edits, copied manifests, or rushed migrations.
Without an explicit startup gate, the system often keeps running and you only discover the drift during an audit or incident review.
The fix is simple: fail fast when transport security is misconfigured, then force a conscious override when you truly need break-glass behavior.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Enabling TLS | Server-side TLS map (`cert_file`, `key_file`, `ca_file`, `min_version`, timeout, SAN pitfalls). | No platform-specific production guardrails that block plaintext by default at app startup. |
| NATS docs: Encrypting Connections with TLS | Client-side TLS connect patterns (`tls://`, client cert, root CAs, mTLS setup). | No opinionated guidance for break-glass plaintext overrides in production apps. |
| nats.go TLS examples | Practical client options (`RootCAs`, `ClientCert`, `Secure(config)`) and TLS scheme behavior. | No app-level policy for override expiry or auditability. |
Broker docs explain how to set TLS. They usually do not provide an opinionated application boot guard that rejects insecure transport by default.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Production gate | If environment is production and URL is not `tls://`, startup returns error unless override is enabled. | Plaintext broker drift is blocked before scheduler begins processing traffic. |
| Override knob | `CORDUM_NATS_ALLOW_PLAINTEXT=true` bypasses the production TLS gate. | Supports emergency operation at the cost of transport-security downgrade. |
| Override lifetime | Current override is boolean-only; there is no built-in expiry timestamp. | A temporary exception can persist indefinitely unless external policy enforces rollback. |
| TLS env application | TLS env variables are evaluated only when URL uses `tls://` scheme. | Avoids false confidence where TLS files exist but plaintext URL is still used. |
| Auth policy | Auth options are applied in priority order: user/pass, token, then nkey seed. | Transport and identity controls are independently configurable. |
| Production auth warning | If production starts without broker auth, bus logs a warning. | Signals insecure identity posture even when transport encryption is present. |
| Test coverage | Tests cover plaintext rejection, override path, dev-mode allowance, and auth option selection. | Regression risk is lower during refactors of connection boot logic. |
Override lease policy
TLS enforcement is only as strong as your override discipline. Treat plaintext override as a lease, not a toggle.
| State | Required rule | Expected outcome |
|---|---|---|
| Default production posture | `NATS_URL` must use `tls://` and `CORDUM_NATS_ALLOW_PLAINTEXT` must be unset. | Process fails fast on insecure transport config. |
| Emergency override active | Override requires owner + explicit expiry timestamp in deployment metadata. | Temporary exception is auditable and auto-expired by policy. |
| Expired override | If current time > override expiry, startup must fail even when override is `true`. | Prevents forgotten downgrade from becoming steady-state. |
Optional hardening patch: expiry-based override
if production && !strings.HasPrefix(url, "tls://") {
if !parseBoolEnv("CORDUM_NATS_ALLOW_PLAINTEXT") {
return nil, fmt.Errorf("nats TLS required in production")
}
// Optional hardening: require an expiry timestamp for break-glass mode.
untilRaw := strings.TrimSpace(os.Getenv("CORDUM_NATS_ALLOW_PLAINTEXT_UNTIL"))
until, err := time.Parse(time.RFC3339, untilRaw)
if err != nil || time.Now().After(until) {
return nil, fmt.Errorf("plaintext override expired or invalid: CORDUM_NATS_ALLOW_PLAINTEXT_UNTIL=%q", untilRaw)
}
}Code-level mechanics
Production transport gate (Go)
// Enforce TLS in production: reject nats:// unless explicitly allowed.
if production && !strings.HasPrefix(url, "tls://") {
if !parseBoolEnv("CORDUM_NATS_ALLOW_PLAINTEXT") {
return nil, fmt.Errorf("nats TLS required in production: use tls:// scheme or set CORDUM_NATS_ALLOW_PLAINTEXT=true")
}
slog.Warn("bus: plaintext NATS allowed in production via override", "url", url)
}TLS + auth layering (Go)
if strings.HasPrefix(url, "tls://") {
tlsConfig, err := natsTLSConfigFromEnv()
if err != nil { return nil, err }
if tlsConfig != nil {
opts = append(opts, nats.Secure(tlsConfig))
}
}
authConfigured := natsApplyAuth(&opts)
if production && !authConfigured {
slog.Warn("bus: NATS authentication not configured in production")
}Regression tests for guard behavior (Go)
func TestNewNatsBus_PlaintextRejectedInProduction(t *testing.T) {
t.Setenv("CORDUM_ENV", "production")
t.Setenv("CORDUM_NATS_ALLOW_PLAINTEXT", "")
_, err := NewNatsBus("nats://localhost:14222")
// expect TLS enforcement error
}
func TestNewNatsBus_PlaintextAllowedWithOverride(t *testing.T) {
t.Setenv("CORDUM_ENV", "production")
t.Setenv("CORDUM_NATS_ALLOW_PLAINTEXT", "true")
_, err := NewNatsBus("nats://localhost:14222")
// should not fail on TLS enforcement check
}Operator runbook
# 1) Confirm transport URL uses tls:// in production
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_URL
# 2) Verify plaintext override is not enabled
kubectl -n cordum exec deploy/cordum-scheduler -- printenv CORDUM_NATS_ALLOW_PLAINTEXT
# 3) If override is enabled, verify expiry and owner metadata
kubectl -n cordum exec deploy/cordum-scheduler -- printenv CORDUM_NATS_ALLOW_PLAINTEXT_UNTIL
kubectl -n cordum get deploy cordum-scheduler -o jsonpath="{.metadata.annotations.security\.cordum\.io/plaintext-owner}"
# 4) Verify TLS material variables
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_TLS_CA NATS_TLS_CERT NATS_TLS_KEY NATS_TLS_SERVER_NAME
# 5) Verify auth layer vars (at least one auth mode)
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_USERNAME NATS_TOKEN NATS_NKEY
# 6) Pre-deploy CI check
cd D:/Cordum/cordum
go test ./core/infra/bus -run "TestNewNatsBus_(PlaintextRejectedInProduction|PlaintextAllowedWithOverride|PlaintextAllowedInDev|TLSURLNotBlockedByEnforcement)"Limitations and tradeoffs
| Decision | Benefit | Cost |
|---|---|---|
| Keep strict startup TLS gate | Prevents silent plaintext drift in production. | Mis-rotated certs can block rollout until fixed. |
| Allow break-glass plaintext override | Supports emergency continuity during TLS outages. | Immediate confidentiality downgrade if not tightly controlled. |
| Require override expiry + owner | Improves auditability and rollback discipline. | Needs extra policy wiring in deploy and review workflows. |
If you ever enable plaintext override in production, treat it as an incident with explicit owner, start time, and rollback deadline.
Next step
Run a transport hardening check this week:
- 1. Verify all production NATS URLs use `tls://`.
- 2. Confirm plaintext override is unset.
- 3. Confirm at least one auth mode is configured.
- 4. Add CI test gate that fails on insecure production transport config.
Continue with JetStream Broadcast Semantics and MaxAckPending Tuning.