The production problem
A plaintext broker endpoint can sneak into production during emergency edits, copied manifests, or rushed migrations.
Without an explicit startup gate, the system often keeps running and you only discover the drift during an audit or incident review.
The fix is simple: fail fast when transport security is misconfigured, then force a conscious override when you truly need break-glass behavior.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS TLS documentation | How to configure TLS certificates and secure NATS server/client transport. | No platform-specific production guardrails that block plaintext by default at app startup. |
| RabbitMQ TLS support | Broker-side TLS enablement and certificate configuration for AMQP clients. | No app-level environment enforcement gate like `reject non-TLS URL in production`. |
| Kafka SSL client configuration | Client SSL properties and truststore/keystore setup for encrypted transport. | No direct equivalent of runtime startup refusal when plaintext endpoint is configured. |
Broker docs explain how to set TLS. They usually do not provide an opinionated application boot guard that rejects insecure transport by default.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Production gate | If environment is production and URL is not `tls://`, startup returns error unless override is enabled. | Plaintext broker drift is blocked before scheduler begins processing traffic. |
| Override knob | `CORDUM_NATS_ALLOW_PLAINTEXT=true` bypasses the production TLS gate. | Supports emergency operation at the cost of transport-security downgrade. |
| TLS env application | TLS env variables are evaluated only when URL uses `tls://` scheme. | Avoids false confidence where TLS files exist but plaintext URL is still used. |
| Auth policy | Auth options are applied in priority order: user/pass, token, then nkey seed. | Transport and identity controls are independently configurable. |
| Production auth warning | If production starts without broker auth, bus logs a warning. | Signals insecure identity posture even when transport encryption is present. |
| Test coverage | Tests cover plaintext rejection, override path, dev-mode allowance, and auth option selection. | Regression risk is lower during refactors of connection boot logic. |
Code-level mechanics
Production transport gate (Go)
// Enforce TLS in production: reject nats:// unless explicitly allowed.
if production && !strings.HasPrefix(url, "tls://") {
if !parseBoolEnv("CORDUM_NATS_ALLOW_PLAINTEXT") {
return nil, fmt.Errorf("nats TLS required in production: use tls:// scheme or set CORDUM_NATS_ALLOW_PLAINTEXT=true")
}
slog.Warn("bus: plaintext NATS allowed in production via override", "url", url)
}TLS + auth layering (Go)
if strings.HasPrefix(url, "tls://") {
tlsConfig, err := natsTLSConfigFromEnv()
if err != nil { return nil, err }
if tlsConfig != nil {
opts = append(opts, nats.Secure(tlsConfig))
}
}
authConfigured := natsApplyAuth(&opts)
if production && !authConfigured {
slog.Warn("bus: NATS authentication not configured in production")
}Regression tests for guard behavior (Go)
func TestNewNatsBus_PlaintextRejectedInProduction(t *testing.T) {
t.Setenv("CORDUM_ENV", "production")
t.Setenv("CORDUM_NATS_ALLOW_PLAINTEXT", "")
_, err := NewNatsBus("nats://localhost:14222")
// expect TLS enforcement error
}
func TestNewNatsBus_PlaintextAllowedWithOverride(t *testing.T) {
t.Setenv("CORDUM_ENV", "production")
t.Setenv("CORDUM_NATS_ALLOW_PLAINTEXT", "true")
_, err := NewNatsBus("nats://localhost:14222")
// should not fail on TLS enforcement check
}Operator runbook
# 1) Confirm transport URL uses tls:// in production kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_URL # 2) Verify plaintext override is not enabled kubectl -n cordum exec deploy/cordum-scheduler -- printenv CORDUM_NATS_ALLOW_PLAINTEXT # 3) Verify TLS material variables kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_TLS_CA NATS_TLS_CERT NATS_TLS_KEY NATS_TLS_SERVER_NAME # 4) Verify auth layer vars (at least one auth mode) kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_USERNAME NATS_TOKEN NATS_NKEY # 5) Pre-deploy CI check cd D:/Cordum/cordum go test ./core/infra/bus -run "TestNewNatsBus_PlaintextRejectedInProduction|TestNewNatsBus_PlaintextAllowedWithOverride"
Limitations and tradeoffs
- - Break-glass plaintext override improves operability but weakens transport security immediately.
- - TLS without broker auth protects confidentiality, not identity.
- - Auth without TLS protects identity, not payload privacy on the wire.
- - Strict startup gating can block deploys when cert material is mis-rotated.
If you ever enable plaintext override in production, treat it as an incident with explicit owner, start time, and rollback deadline.
Next step
Run a transport hardening check this week:
- 1. Verify all production NATS URLs use `tls://`.
- 2. Confirm plaintext override is unset.
- 3. Confirm at least one auth mode is configured.
- 4. Add CI test gate that fails on insecure production transport config.
Continue with JetStream Broadcast Semantics and MaxAckPending Tuning.