Skip to content
Deep Dive

AI Agent AckWait and Dedup TTL Alignment

A queue can be healthy and still duplicate work after crashes if ack windows and dedup windows drift apart.

Deep Dive10 min readApr 2026
TL;DR
  • -JetStream is at-least-once. Redelivery after missing ack is expected behavior.
  • -Cordum uses Redis processed keys (`cordum:bus:processed:<stream>:<seq>`) to reduce duplicate handling after crashes.
  • -Current code keeps `processedKeyTTL=10m` while `NATS_JS_ACK_WAIT` can be overridden, which creates a mismatch risk when AckWait is set above 10m.
  • -JetStream `Nats-Msg-Id` dedup controls duplicate publish writes, not post-consume crash replays.
  • -If AckWait exceeds dedup TTL, a crash-before-ack path can redeliver after the dedup key already expired.
Crash window covered

Processed key is written before Ack, so crash-after-process is usually deduped on redelivery.

Config mismatch risk

AckWait is configurable via env, processed key TTL is fixed to 10m in current implementation.

Operational check

Treat `effective AckWait <= dedup TTL` as a hard safety invariant in production.

Scope

This guide focuses on scheduler bus-consumer reliability in JetStream mode. It does not cover worker handler idempotency design.

The production problem

A common crash path looks harmless: message is processed, process dies before ack, message redelivers later. If your dedup state expired before redelivery, you run the job again.

This is not a broker bug. It is a timing mismatch between the redelivery window and your dedup key lifetime.

In AI agent control planes, duplicate execution can create duplicate side effects, noisy incidents, and hard audit questions.

What top results miss

SourceStrong coverageMissing piece
NATS JetStream consumer docsAckWait, MaxAckPending, MaxDeliver, and at-least-once delivery behavior.No guidance on app-level Redis dedup TTL alignment when AckWait is overridden.
NATS JetStream model deep diveMessage dedup via `Nats-Msg-Id`, ack models, redelivery behavior, and duplicate window defaults.No Redis-side processed-key TTL alignment strategy for consumer crash windows.
NATS blog: infinite message deduplicationHow publish deduplication and stream discard policy shape ingestion guarantees.Focuses on publish-path dedup; does not address post-consume replay after process-before-ack crashes.

Public docs explain ack deadlines well. They usually stop short of one practical question: does your dedup persistence window actually match your redelivery window?

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Default AckWait`defaultAckWait = 10m` and JetStream subscriptions use `nats.AckWait(b.ackWait)`.Redelivery window is controlled by AckWait when BackOff is not configured.
AckWait override`NATS_JS_ACK_WAIT` can override AckWait via `time.ParseDuration`.Effective redelivery window may diverge from built-in dedup TTL assumptions.
Publish dedup window`computeMsgID()` feeds `nats.MsgId(...)` and stream `Duplicates` window is configured as `2m`.Prevents duplicate writes in a short window, but does not dedupe consumer-side crash replays.
Processed key TTL`processedKeyTTL = 10m` for `cordum:bus:processed:<stream>:<seq>`.If AckWait is higher than 10m, dedup key can expire before redelivery occurs.
Inflight key TTL`inflightKeyTTL = 2m` on `cordum:bus:inflight:<stream>:<seq>`.Purely operational signal; not used for correctness decisions.
Retry fallbackRedis dedup check/set failures trigger `NakWithDelay(2s)`.Prevents blind ack when dedup state cannot be safely evaluated.
Delivery cap`MaxDeliver(maxJSRedeliveries)` with `maxJSRedeliveries=100`.Poison messages are eventually terminated instead of blocking forever.

Code-level mechanics

Ack and dedup constants (Go)

core/infra/bus/nats.go
Go
const (
  defaultAckWait = 10 * time.Minute

  // processed key: cordum:bus:processed:<stream>:<seq>
  processedKeyTTL = 10 * time.Minute

  // inflight key: cordum:bus:inflight:<stream>:<seq>
  inflightKeyTTL = 2 * time.Minute
)

AckWait override and consumer options (Go)

core/infra/bus/nats.go
Go
ackWait := defaultAckWait
if v := strings.TrimSpace(os.Getenv("NATS_JS_ACK_WAIT")); v != "" {
  if d, err := time.ParseDuration(v); err == nil && d > 0 {
    ackWait = d
  }
}
b.ackWait = ackWait

opts := []nats.SubOpt{
  nats.ManualAck(),
  nats.AckExplicit(),
  nats.AckWait(b.ackWait),
  nats.MaxAckPending(natsMaxAckPending()),
  nats.MaxDeliver(maxJSRedeliveries),
}

This is the key mismatch zone. `AckWait` is runtime-configurable, while processed-key TTL remains a fixed constant in current code.

Publish dedup boundary (Go)

core/infra/bus/nats.go
Go
msgID := computeMsgID(subject, packet)
if msgID != "" {
  _, err = b.js.Publish(subject, data, nats.MsgId(msgID))
}

_, err := js.AddStream(&nats.StreamConfig{
  Name:       streamJobs,
  Subjects:   []string{"job.>", "worker.*.jobs"},
  Duplicates: 2 * time.Minute,
})

`Nats-Msg-Id` dedup lives on the publish path with a stream duplicate window. It does not cover consumer replay after process-before-ack failure.

Dedup flow around ack (Go)

core/infra/bus/nats.go
Go
// 1) Check processed key (crash-safe guard)
exists, err := b.redis.Exists(ctx, processedKey(stream, seq)).Result()
if err != nil {
  _ = msg.NakWithDelay(2 * time.Second)
  return
}
if exists > 0 {
  _ = msg.Ack()
  return
}

// 2) Process handler
action, delay := processBusMsg(msg.Data, handler, numDelivered)

// 3) On success, set processed key BEFORE Ack
if setErr := b.redis.Set(ctx, processedKey(stream, seq), "1", processedKeyTTL).Err(); setErr != nil {
  _ = msg.NakWithDelay(2 * time.Second)
  return
}
_ = msg.Ack()

Writing processed state before ack is correct for crash safety. The risk is not ordering. The risk is key lifetime versus redelivery timing.

Operator runbook

ackwait_dedup_alignment.sh
Bash
# 1) Check effective AckWait (empty means default 10m)
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_JS_ACK_WAIT

# 2) Confirm stream duplicate window (publish-path context only)
nats stream info CORDUM_JOBS | rg -i "Duplicate|Duplicates"

# 3) Sample processed key TTLs
for k in $(redis-cli --scan --pattern 'cordum:bus:processed:CORDUM_JOBS:*' | head -n 20); do
  echo "$k $(redis-cli TTL "$k")"
done

# 4) Safety invariant
# effective_ack_wait_seconds <= processed_key_ttl_seconds

# 5) If AckWait is customized above 10m, run crash drill:
#    - publish one long-running job
#    - kill scheduler after handler success but before ack window closes
#    - verify duplicate processing does not occur on redelivery

# 6) Alerting suggestions
# - high redelivery count per subject
# - spikes in dedup-skipped acks
# - processed-key TTL histogram drifting below configured AckWait

Limitations and tradeoffs

  • - Keeping dedup TTL high reduces duplicates but increases Redis key retention.
  • - Lower AckWait speeds recovery but raises redelivery pressure for slow handlers.
  • - Publish dedup (`Nats-Msg-Id`) and consumer replay dedup solve different failure paths.
  • - Redis outages fall back to JetStream-only behavior, so duplicate exposure increases.
  • - Queue-level dedup does not replace business-level idempotency in worker code.

Treat this as an invariant: effective AckWait <= dedup TTL. If this drifts, crash recovery can replay already-completed work.

Next step

Run one controlled crash drill this week and record three numbers:

  1. 1. Effective AckWait in each environment.
  2. 2. Observed processed-key TTL distribution in Redis.
  3. 3. Duplicate execution count after crash-before-ack simulation.

Continue with Exactly-Once Myth and Timeouts, Retries, and Backoff.

Control-plane reliability

The expensive incident is not queue downtime. It is running the same side effect twice and learning about it from a customer.