The production problem
A team rotates auth mode from token to NKey. Old token env var stays in deployment. NKey is present but never used.
Nothing crashes. Connection still works. Security assumptions are now wrong and hard to detect from outside.
This is exactly why precedence rules must be documented, tested, and enforced by config policy.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS docs: Tokens | Server-side token configuration and client token connection pattern. | No guidance for mixed envs where token is set but shadowed by user/password in app code. |
| NATS docs: NKeys | Challenge-response auth flow and key-handling model for NKey clients. | No guidance for mixed env scenarios where NKey is configured but silently shadowed. |
| nats.go package docs | Client options (`UserInfo`, `Token`, `NkeyOptionFromSeed`) and mutual-exclusion errors in callbacks. | No production playbook for env-collision detection and CI enforcement in wrapper libraries. |
Docs explain mechanisms well. They rarely explain what happens when multiple mechanisms are configured at once in a single process.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Selection order | `natsApplyAuth` checks user/pass, then token, then nkey seed. | Auth mode is deterministic and easy to reason about in code review. |
| User/pass requirement | Both username and password must be set, otherwise this mode is skipped. | Partial credentials do not silently degrade into insecure auth. |
| Token fallback | Token mode activates only when user/pass mode is not selected. | If both are set, token is ignored by design. |
| NKey fallback | NKey mode activates only when both higher-priority modes are absent. | NKey can be unintentionally shadowed by leftover token or user/pass values. |
| Invalid NKey seed | Bad seed logs error and returns not-configured state for NKey path. | Connection can proceed without intended auth unless other safeguards exist. |
| Production warning | If no auth mode is configured in production, Cordum logs a warning. | Visibility exists, but warning alone does not block startup. |
Env collision matrix
The easiest way to remove auth ambiguity is to make collisions impossible. This matrix shows the effective mode for common env combinations in current Cordum code.
| Configured env set | Effective mode | Risk |
|---|---|---|
| NATS_USERNAME+NATS_PASSWORD only | User/Pass | Expected path when user/pass is your selected mode. |
| NATS_TOKEN only | Token | Simple rollout, but token can be accidentally left behind during migrations. |
| NATS_NKEY only | NKey | Strong mode, but invalid seed falls back to unauthenticated in current code path. |
| User/Pass + Token | User/Pass | Token is ignored. Operators may believe token rotation is active when it is not. |
| Token + NKey | Token | NKey is shadowed by token. |
| User/Pass + Token + NKey | User/Pass | Highest collision risk. Migration intent is often unclear in incident review. |
CI guard: fail if not exactly one auth mode
modes=0
if [ -n "${NATS_USERNAME:-}" ] && [ -n "${NATS_PASSWORD:-}" ]; then
modes=$((modes+1))
fi
if [ -n "${NATS_TOKEN:-}" ]; then
modes=$((modes+1))
fi
if [ -n "${NATS_NKEY:-}" ]; then
modes=$((modes+1))
fi
if [ "$modes" -ne 1 ]; then
echo "NATS auth policy violation: expected exactly one mode, got $modes"
exit 1
fiCode-level mechanics
Auth selection function (Go)
func natsApplyAuth(opts *[]nats.Option) bool {
username := strings.TrimSpace(os.Getenv("NATS_USERNAME"))
password := strings.TrimSpace(os.Getenv("NATS_PASSWORD"))
if username != "" && password != "" {
*opts = append(*opts, nats.UserInfo(username, password))
return true
}
token := strings.TrimSpace(os.Getenv("NATS_TOKEN"))
if token != "" {
*opts = append(*opts, nats.Token(token))
return true
}
nkey := strings.TrimSpace(os.Getenv("NATS_NKEY"))
if nkey != "" {
opt, err := nats.NkeyOptionFromSeed(nkey)
if err != nil {
slog.Error("bus: invalid NATS_NKEY seed", "err", err)
return false
}
*opts = append(*opts, opt)
return true
}
return false
}Priority and edge-case tests (Go)
func TestNatsApplyAuth_PriorityOrder(t *testing.T) {
// Username/password takes priority over token.
t.Setenv("NATS_USERNAME", "alice")
t.Setenv("NATS_PASSWORD", "secret")
t.Setenv("NATS_TOKEN", "also-set")
t.Setenv("NATS_NKEY", "")
configured := natsApplyAuth(&opts)
// expects one option: user/password
}
func TestNatsApplyAuth_UsernameWithoutPassword(t *testing.T) {
// user without password should not configure auth
}Operator runbook
# 1) Print active env values kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_USERNAME NATS_PASSWORD NATS_TOKEN NATS_NKEY # 2) Enforce single-mode policy # choose exactly one: # - user/pass # - token # - nkey # 3) Remove stale vars from old mode kubectl -n cordum set env deploy/cordum-scheduler NATS_TOKEN- kubectl -n cordum set env deploy/cordum-scheduler NATS_NKEY- # 4) Add CI collision guard (fail if modes != 1) ./ci-nats-auth-lint.sh # 5) Validate precedence behavior in CI cd D:/Cordum/cordum go test ./core/infra/bus -run "TestNatsApplyAuth_(PriorityOrder|Token|UserInfo|UsernameWithoutPassword|NKeyInvalidSeed)" # 6) Post-deploy check # confirm broker auth failures are zero and connection handshake succeeds
Limitations and tradeoffs
| Decision | Benefit | Cost |
|---|---|---|
| Keep deterministic precedence | Predictable behavior in code and incident timelines. | Stale lower-priority secrets can remain unnoticed. |
| Warning-only when no auth in production | Startup stays available during temporary misconfiguration. | Security posture depends on operators noticing warnings quickly. |
| Single-mode CI lint enforcement | Prevents mixed-mode drift before deploy. | Requires pipeline wiring and env template discipline. |
Treat auth mode as a single source of truth. Never leave fallback credentials in environment after migration.
Next step
Do one auth posture cleanup pass:
- 1. Pick one intended auth mode per environment.
- 2. Remove all env vars for unused modes.
- 3. Add CI assertions for precedence-sensitive env combinations.
- 4. Re-run connection smoke tests after every credential rotation.
Continue with NATS TLS Enforcement and JetStream Broadcast Semantics.