The production problem
A common crash path looks harmless: message is processed, process dies before ack, message redelivers later. If your dedup state expired before redelivery, you run the job again.
This is not a broker bug. It is a timing mismatch between the redelivery window and your dedup key lifetime.
In AI agent control planes, duplicate execution can create duplicate side effects, noisy incidents, and hard audit questions.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS JetStream consumer docs | AckWait, MaxAckPending, MaxDeliver, and at-least-once delivery behavior. | No guidance on app-level Redis dedup TTL alignment when AckWait is overridden. |
| NATS JetStream model deep dive | Message dedup via `Nats-Msg-Id`, ack models, redelivery behavior, and duplicate window defaults. | No Redis-side processed-key TTL alignment strategy for consumer crash windows. |
| NATS blog: infinite message deduplication | How publish deduplication and stream discard policy shape ingestion guarantees. | Focuses on publish-path dedup; does not address post-consume replay after process-before-ack crashes. |
Public docs explain ack deadlines well. They usually stop short of one practical question: does your dedup persistence window actually match your redelivery window?
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Default AckWait | `defaultAckWait = 10m` and JetStream subscriptions use `nats.AckWait(b.ackWait)`. | Redelivery window is controlled by AckWait when BackOff is not configured. |
| AckWait override | `NATS_JS_ACK_WAIT` can override AckWait via `time.ParseDuration`. | Effective redelivery window may diverge from built-in dedup TTL assumptions. |
| Publish dedup window | `computeMsgID()` feeds `nats.MsgId(...)` and stream `Duplicates` window is configured as `2m`. | Prevents duplicate writes in a short window, but does not dedupe consumer-side crash replays. |
| Processed key TTL | `processedKeyTTL = 10m` for `cordum:bus:processed:<stream>:<seq>`. | If AckWait is higher than 10m, dedup key can expire before redelivery occurs. |
| Inflight key TTL | `inflightKeyTTL = 2m` on `cordum:bus:inflight:<stream>:<seq>`. | Purely operational signal; not used for correctness decisions. |
| Retry fallback | Redis dedup check/set failures trigger `NakWithDelay(2s)`. | Prevents blind ack when dedup state cannot be safely evaluated. |
| Delivery cap | `MaxDeliver(maxJSRedeliveries)` with `maxJSRedeliveries=100`. | Poison messages are eventually terminated instead of blocking forever. |
Code-level mechanics
Ack and dedup constants (Go)
const ( defaultAckWait = 10 * time.Minute // processed key: cordum:bus:processed:<stream>:<seq> processedKeyTTL = 10 * time.Minute // inflight key: cordum:bus:inflight:<stream>:<seq> inflightKeyTTL = 2 * time.Minute )
AckWait override and consumer options (Go)
ackWait := defaultAckWait
if v := strings.TrimSpace(os.Getenv("NATS_JS_ACK_WAIT")); v != "" {
if d, err := time.ParseDuration(v); err == nil && d > 0 {
ackWait = d
}
}
b.ackWait = ackWait
opts := []nats.SubOpt{
nats.ManualAck(),
nats.AckExplicit(),
nats.AckWait(b.ackWait),
nats.MaxAckPending(natsMaxAckPending()),
nats.MaxDeliver(maxJSRedeliveries),
}This is the key mismatch zone. `AckWait` is runtime-configurable, while processed-key TTL remains a fixed constant in current code.
Publish dedup boundary (Go)
msgID := computeMsgID(subject, packet)
if msgID != "" {
_, err = b.js.Publish(subject, data, nats.MsgId(msgID))
}
_, err := js.AddStream(&nats.StreamConfig{
Name: streamJobs,
Subjects: []string{"job.>", "worker.*.jobs"},
Duplicates: 2 * time.Minute,
})`Nats-Msg-Id` dedup lives on the publish path with a stream duplicate window. It does not cover consumer replay after process-before-ack failure.
Dedup flow around ack (Go)
// 1) Check processed key (crash-safe guard)
exists, err := b.redis.Exists(ctx, processedKey(stream, seq)).Result()
if err != nil {
_ = msg.NakWithDelay(2 * time.Second)
return
}
if exists > 0 {
_ = msg.Ack()
return
}
// 2) Process handler
action, delay := processBusMsg(msg.Data, handler, numDelivered)
// 3) On success, set processed key BEFORE Ack
if setErr := b.redis.Set(ctx, processedKey(stream, seq), "1", processedKeyTTL).Err(); setErr != nil {
_ = msg.NakWithDelay(2 * time.Second)
return
}
_ = msg.Ack()Writing processed state before ack is correct for crash safety. The risk is not ordering. The risk is key lifetime versus redelivery timing.
Operator runbook
# 1) Check effective AckWait (empty means default 10m) kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_JS_ACK_WAIT # 2) Confirm stream duplicate window (publish-path context only) nats stream info CORDUM_JOBS | rg -i "Duplicate|Duplicates" # 3) Sample processed key TTLs for k in $(redis-cli --scan --pattern 'cordum:bus:processed:CORDUM_JOBS:*' | head -n 20); do echo "$k $(redis-cli TTL "$k")" done # 4) Safety invariant # effective_ack_wait_seconds <= processed_key_ttl_seconds # 5) If AckWait is customized above 10m, run crash drill: # - publish one long-running job # - kill scheduler after handler success but before ack window closes # - verify duplicate processing does not occur on redelivery # 6) Alerting suggestions # - high redelivery count per subject # - spikes in dedup-skipped acks # - processed-key TTL histogram drifting below configured AckWait
Limitations and tradeoffs
- - Keeping dedup TTL high reduces duplicates but increases Redis key retention.
- - Lower AckWait speeds recovery but raises redelivery pressure for slow handlers.
- - Publish dedup (`Nats-Msg-Id`) and consumer replay dedup solve different failure paths.
- - Redis outages fall back to JetStream-only behavior, so duplicate exposure increases.
- - Queue-level dedup does not replace business-level idempotency in worker code.
Treat this as an invariant: effective AckWait <= dedup TTL. If this drifts, crash recovery can replay already-completed work.
Next step
Run one controlled crash drill this week and record three numbers:
- 1. Effective AckWait in each environment.
- 2. Observed processed-key TTL distribution in Redis.
- 3. Duplicate execution count after crash-before-ack simulation.
Continue with Exactly-Once Myth and Timeouts, Retries, and Backoff.