Skip to content
Deep Dive

AI Agent AckWait and Dedup TTL Alignment

A queue can be healthy and still duplicate work after crashes if ack windows and dedup windows drift apart.

Deep Dive10 min readMar 2026
TL;DR
  • -JetStream is at-least-once. Redelivery after missing ack is expected behavior.
  • -Cordum uses Redis processed keys (`cordum:bus:processed:<stream>:<seq>`) to reduce duplicate handling after crashes.
  • -Current code keeps `processedKeyTTL=10m` while `NATS_JS_ACK_WAIT` can be overridden, which creates a mismatch risk when AckWait is set above 10m.
  • -If AckWait exceeds dedup TTL, a crash-before-ack path can redeliver after the dedup key already expired.
Crash window covered

Processed key is written before Ack, so crash-after-process is usually deduped on redelivery.

Config mismatch risk

AckWait is configurable via env, processed key TTL is fixed to 10m in current implementation.

Operational check

Treat `effective AckWait <= dedup TTL` as a hard safety invariant in production.

Scope

This guide focuses on scheduler bus-consumer reliability in JetStream mode. It does not cover worker handler idempotency design.

The production problem

A common crash path looks harmless: message is processed, process dies before ack, message redelivers later. If your dedup state expired before redelivery, you run the job again.

This is not a broker bug. It is a timing mismatch between the redelivery window and your dedup key lifetime.

In AI agent control planes, duplicate execution can create duplicate side effects, noisy incidents, and hard audit questions.

What top results miss

SourceStrong coverageMissing piece
NATS JetStream consumer docsAckWait, MaxAckPending, MaxDeliver, and at-least-once delivery behavior.No guidance on app-level Redis dedup TTL alignment when AckWait is overridden.
AWS SQS visibility timeout guidanceTimeout too short causes duplicates; timeout too long delays retries.No stream-sequence dedup key design for crash-before-ack windows.
Google Pub/Sub lease managementAck deadline extension and duplicate-vs-redelivery tradeoff.No fixed TTL dedup key discussion tied to overrideable ack deadline settings.

Public docs explain ack deadlines well. They usually stop short of one practical question: does your dedup persistence window actually match your redelivery window?

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Default AckWait`defaultAckWait = 10m` and JetStream subscriptions use `nats.AckWait(b.ackWait)`.Redelivery window is controlled by AckWait when BackOff is not configured.
AckWait override`NATS_JS_ACK_WAIT` can override AckWait via `time.ParseDuration`.Effective redelivery window may diverge from built-in dedup TTL assumptions.
Processed key TTL`processedKeyTTL = 10m` for `cordum:bus:processed:<stream>:<seq>`.If AckWait is higher than 10m, dedup key can expire before redelivery occurs.
Inflight key TTL`inflightKeyTTL = 2m` on `cordum:bus:inflight:<stream>:<seq>`.Purely operational signal; not used for correctness decisions.
Retry fallbackRedis dedup check/set failures trigger `NakWithDelay(2s)`.Prevents blind ack when dedup state cannot be safely evaluated.
Delivery cap`MaxDeliver(maxJSRedeliveries)` with `maxJSRedeliveries=100`.Poison messages are eventually terminated instead of blocking forever.

Code-level mechanics

Ack and dedup constants (Go)

core/infra/bus/nats.go
Go
const (
  defaultAckWait = 10 * time.Minute

  // processed key: cordum:bus:processed:<stream>:<seq>
  processedKeyTTL = 10 * time.Minute

  // inflight key: cordum:bus:inflight:<stream>:<seq>
  inflightKeyTTL = 2 * time.Minute
)

AckWait override and consumer options (Go)

core/infra/bus/nats.go
Go
ackWait := defaultAckWait
if v := strings.TrimSpace(os.Getenv("NATS_JS_ACK_WAIT")); v != "" {
  if d, err := time.ParseDuration(v); err == nil && d > 0 {
    ackWait = d
  }
}
b.ackWait = ackWait

opts := []nats.SubOpt{
  nats.ManualAck(),
  nats.AckExplicit(),
  nats.AckWait(b.ackWait),
  nats.MaxAckPending(natsMaxAckPending()),
  nats.MaxDeliver(maxJSRedeliveries),
}

This is the key mismatch zone. `AckWait` is runtime-configurable, while processed-key TTL remains a fixed constant in current code.

Dedup flow around ack (Go)

core/infra/bus/nats.go
Go
// 1) Check processed key (crash-safe guard)
exists, err := b.redis.Exists(ctx, processedKey(stream, seq)).Result()
if err != nil {
  _ = msg.NakWithDelay(2 * time.Second)
  return
}
if exists > 0 {
  _ = msg.Ack()
  return
}

// 2) Process handler
action, delay := processBusMsg(msg.Data, handler, numDelivered)

// 3) On success, set processed key BEFORE Ack
if setErr := b.redis.Set(ctx, processedKey(stream, seq), "1", processedKeyTTL).Err(); setErr != nil {
  _ = msg.NakWithDelay(2 * time.Second)
  return
}
_ = msg.Ack()

Writing processed state before ack is correct for crash safety. The risk is not ordering. The risk is key lifetime versus redelivery timing.

Operator runbook

ackwait_dedup_alignment.sh
Bash
# 1) Check effective AckWait (empty means default 10m)
kubectl -n cordum exec deploy/cordum-scheduler -- printenv NATS_JS_ACK_WAIT

# 2) Sample processed key TTLs
for k in $(redis-cli --scan --pattern 'cordum:bus:processed:CORDUM_JOBS:*' | head -n 20); do
  echo "$k $(redis-cli TTL "$k")"
done

# 3) Safety invariant
# effective_ack_wait_seconds <= processed_key_ttl_seconds

# 4) If AckWait is customized above 10m, run crash drill:
#    - publish one long-running job
#    - kill scheduler after handler success but before ack window closes
#    - verify duplicate processing does not occur on redelivery

# 5) Alerting suggestions
# - high redelivery count per subject
# - spikes in dedup-skipped acks
# - processed-key TTL histogram drifting below configured AckWait

Limitations and tradeoffs

  • - Keeping dedup TTL high reduces duplicates but increases Redis key retention.
  • - Lower AckWait speeds recovery but raises redelivery pressure for slow handlers.
  • - Redis outages fall back to JetStream-only behavior, so duplicate exposure increases.
  • - Queue-level dedup does not replace business-level idempotency in worker code.

Treat this as an invariant: effective AckWait <= dedup TTL. If this drifts, crash recovery can replay already-completed work.

Next step

Run one controlled crash drill this week and record three numbers:

  1. 1. Effective AckWait in each environment.
  2. 2. Observed processed-key TTL distribution in Redis.
  3. 3. Duplicate execution count after crash-before-ack simulation.

Continue with Exactly-Once Myth and Timeouts, Retries, and Backoff.

Control-plane reliability

The expensive incident is not queue downtime. It is running the same side effect twice and learning about it from a customer.