The production problem
A common crash path looks harmless: message is processed, process dies before ack, message redelivers later. If your dedup state expired before redelivery, you run the job again.
This is not a broker bug. It is a timing mismatch between the redelivery window and your dedup key lifetime.
In AI agent control planes, duplicate execution can create duplicate side effects, noisy incidents, and hard audit questions.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| NATS JetStream consumer docs | AckWait, MaxAckPending, MaxDeliver, and at-least-once delivery behavior. | No guidance on app-level Redis dedup TTL alignment when AckWait is overridden. |
| AWS SQS visibility timeout guidance | Timeout too short causes duplicates; timeout too long delays retries. | No stream-sequence dedup key design for crash-before-ack windows. |
| Google Pub/Sub lease management | Ack deadline extension and duplicate-vs-redelivery tradeoff. | No fixed TTL dedup key discussion tied to overrideable ack deadline settings. |
Public docs explain ack deadlines well. They usually stop short of one practical question: does your dedup persistence window actually match your redelivery window?
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Default AckWait | `defaultAckWait = 10m` and JetStream subscriptions use `nats.AckWait(b.ackWait)`. | Redelivery window is controlled by AckWait when BackOff is not configured. |
| AckWait override | `NATS_JS_ACK_WAIT` can override AckWait via `time.ParseDuration`. | Effective redelivery window may diverge from built-in dedup TTL assumptions. |
| Processed key TTL | `processedKeyTTL = 10m` for `cordum:bus:processed:<stream>:<seq>`. | If AckWait is higher than 10m, dedup key can expire before redelivery occurs. |
| Inflight key TTL | `inflightKeyTTL = 2m` on `cordum:bus:inflight:<stream>:<seq>`. | Purely operational signal; not used for correctness decisions. |
| Retry fallback | Redis dedup check/set failures trigger `NakWithDelay(2s)`. | Prevents blind ack when dedup state cannot be safely evaluated. |
| Delivery cap | `MaxDeliver(maxJSRedeliveries)` with `maxJSRedeliveries=100`. | Poison messages are eventually terminated instead of blocking forever. |
Code-level mechanics
Ack and dedup constants (Go)
const ( defaultAckWait = 10 * time.Minute // processed key: cordum:bus:processed:<stream>:<seq> processedKeyTTL = 10 * time.Minute // inflight key: cordum:bus:inflight:<stream>:<seq> inflightKeyTTL = 2 * time.Minute )
AckWait override and consumer options (Go)
ackWait := defaultAckWait
if v := strings.TrimSpace(os.Getenv("NATS_JS_ACK_WAIT")); v != "" {
if d, err := time.ParseDuration(v); err == nil && d > 0 {
ackWait = d
}
}
b.ackWait = ackWait
opts := []nats.SubOpt{
nats.ManualAck(),
nats.AckExplicit(),
nats.AckWait(b.ackWait),
nats.MaxAckPending(natsMaxAckPending()),
nats.MaxDeliver(maxJSRedeliveries),
}This is the key mismatch zone. `AckWait` is runtime-configurable, while processed-key TTL remains a fixed constant in current code.
Dedup flow around ack (Go)
// 1) Check processed key (crash-safe guard)
exists, err := b.redis.Exists(ctx, processedKey(stream, seq)).Result()
if err != nil {
_ = msg.NakWithDelay(2 * time.Second)
return
}
if exists > 0 {
_ = msg.Ack()
return
}
// 2) Process handler
action, delay := processBusMsg(msg.Data, handler, numDelivered)
// 3) On success, set processed key BEFORE Ack
if setErr := b.redis.Set(ctx, processedKey(stream, seq), "1", processedKeyTTL).Err(); setErr != nil {
_ = msg.NakWithDelay(2 * time.Second)
return
}
_ = msg.Ack()Writing processed state before ack is correct for crash safety. The risk is not ordering. The risk is key lifetime versus redelivery timing.
Operator runbook
# 1) Check effective AckWait (empty means default 10m) kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_JS_ACK_WAIT # 2) Sample processed key TTLs for k in $(redis-cli --scan --pattern 'cordum:bus:processed:CORDUM_JOBS:*' | head -n 20); do echo "$k $(redis-cli TTL "$k")" done # 3) Safety invariant # effective_ack_wait_seconds <= processed_key_ttl_seconds # 4) If AckWait is customized above 10m, run crash drill: # - publish one long-running job # - kill scheduler after handler success but before ack window closes # - verify duplicate processing does not occur on redelivery # 5) Alerting suggestions # - high redelivery count per subject # - spikes in dedup-skipped acks # - processed-key TTL histogram drifting below configured AckWait
Limitations and tradeoffs
- - Keeping dedup TTL high reduces duplicates but increases Redis key retention.
- - Lower AckWait speeds recovery but raises redelivery pressure for slow handlers.
- - Redis outages fall back to JetStream-only behavior, so duplicate exposure increases.
- - Queue-level dedup does not replace business-level idempotency in worker code.
Treat this as an invariant: effective AckWait <= dedup TTL. If this drifts, crash recovery can replay already-completed work.
Next step
Run one controlled crash drill this week and record three numbers:
- 1. Effective AckWait in each environment.
- 2. Observed processed-key TTL distribution in Redis.
- 3. Duplicate execution count after crash-before-ack simulation.
Continue with Exactly-Once Myth and Timeouts, Retries, and Backoff.