The production problem
A team rotates auth mode from token to NKey. Old token env var stays in deployment. NKey is present but never used.
Nothing crashes. Connection still works. Security assumptions are now wrong and hard to detect from outside.
This is exactly why precedence rules must be documented, tested, and enforced by config policy.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS auth introduction | Token, user/password, and nkey auth concepts in NATS. | No application-level precedence policy when multiple credential modes are simultaneously configured. |
| NATS NKey auth docs | Challenge-response model and key-handling benefits of NKey mode. | No guidance for mixed env scenarios where NKey is configured but overshadowed by higher-priority mode. |
| Kafka security overview | Authentication and encryption modes can be mixed at deployment level. | No direct equivalent of single-process env precedence in lightweight broker client wrappers. |
Docs explain mechanisms well. They rarely explain what happens when multiple mechanisms are configured at once in a single process.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Selection order | `natsApplyAuth` checks user/pass, then token, then nkey seed. | Auth mode is deterministic and easy to reason about in code review. |
| User/pass requirement | Both username and password must be set, otherwise this mode is skipped. | Partial credentials do not silently degrade into insecure auth. |
| Token fallback | Token mode activates only when user/pass mode is not selected. | If both are set, token is ignored by design. |
| NKey fallback | NKey mode activates only when both higher-priority modes are absent. | NKey can be unintentionally shadowed by leftover token or user/pass values. |
| Invalid NKey seed | Bad seed logs error and returns not-configured state for NKey path. | Connection can proceed without intended auth unless other safeguards exist. |
| Production warning | If no auth mode is configured in production, Cordum logs a warning. | Visibility exists, but warning alone does not block startup. |
Code-level mechanics
Auth selection function (Go)
func natsApplyAuth(opts *[]nats.Option) bool {
username := strings.TrimSpace(os.Getenv("NATS_USERNAME"))
password := strings.TrimSpace(os.Getenv("NATS_PASSWORD"))
if username != "" && password != "" {
*opts = append(*opts, nats.UserInfo(username, password))
return true
}
token := strings.TrimSpace(os.Getenv("NATS_TOKEN"))
if token != "" {
*opts = append(*opts, nats.Token(token))
return true
}
nkey := strings.TrimSpace(os.Getenv("NATS_NKEY"))
if nkey != "" {
opt, err := nats.NkeyOptionFromSeed(nkey)
if err != nil {
slog.Error("bus: invalid NATS_NKEY seed", "err", err)
return false
}
*opts = append(*opts, opt)
return true
}
return false
}Priority and edge-case tests (Go)
func TestNatsApplyAuth_PriorityOrder(t *testing.T) {
// Username/password takes priority over token.
t.Setenv("NATS_USERNAME", "alice")
t.Setenv("NATS_PASSWORD", "secret")
t.Setenv("NATS_TOKEN", "also-set")
t.Setenv("NATS_NKEY", "")
configured := natsApplyAuth(&opts)
// expects one option: user/password
}
func TestNatsApplyAuth_UsernameWithoutPassword(t *testing.T) {
// user without password should not configure auth
}Operator runbook
# 1) Print active env values kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_USERNAME NATS_PASSWORD NATS_TOKEN NATS_NKEY # 2) Enforce single-mode policy # choose exactly one: # - user/pass # - token # - nkey # 3) Remove stale vars from old mode kubectl -n cordum set env deploy/cordum-scheduler NATS_TOKEN- kubectl -n cordum set env deploy/cordum-scheduler NATS_NKEY- # 4) Validate precedence behavior in CI cd D:/Cordum/cordum go test ./core/infra/bus -run "TestNatsApplyAuth_PriorityOrder|TestNatsApplyAuth_UsernameWithoutPassword" # 5) Post-deploy check # confirm broker auth failures are zero and connection handshake succeeds
Limitations and tradeoffs
- - Deterministic precedence improves predictability but can hide stale lower-priority secrets.
- - Warning-only behavior for missing auth in production favors availability over strict enforcement.
- - NKey security benefits are lost if user/pass or token accidentally remains configured.
- - Changing auth mode without cleanup can produce partial rollouts with inconsistent identities.
Treat auth mode as a single source of truth. Never leave fallback credentials in environment after migration.
Next step
Do one auth posture cleanup pass:
- 1. Pick one intended auth mode per environment.
- 2. Remove all env vars for unused modes.
- 3. Add CI assertions for precedence-sensitive env combinations.
- 4. Re-run connection smoke tests after every credential rotation.
Continue with NATS TLS Enforcement and JetStream Broadcast Semantics.