Skip to content
Deep Dive

AI Agent NATS JetStream Poison Message Termination

`AckTerm` stops redelivery. The order around it decides whether failure evidence survives a crash.

Deep Dive10 min readMar 2026
TL;DR
  • -JetStream supports `AckTerm` (`+TERM`) to stop redelivery without marking successful processing.
  • -Cordum terminates poison messages after max deliveries, but writes DLQ evidence first.
  • -Term-before-DLQ creates a crash window where bad messages disappear without trace.
  • -DLQ-before-Term closes that window and keeps replay/forensics possible.
Failure window

One process crash between `Term()` and DLQ write can erase the only copy of failure context.

Code guardrail

Cordum already uses DLQ-before-Term in the poison-pill path.

Operational payoff

Triage and replay stay possible even during partial failures.

Scope

This guide focuses on poison-message termination ordering in JetStream consumer handlers, not full retry policy design for every job type.

The production problem

A poison payload hits max deliveries. Handler calls `Term()` and moves on. Then the process crashes before writing DLQ evidence.

Queue pressure goes down, but your forensic trail disappears.

That is the wrong trade. You want both queue health and evidence durability.

What top results cover and miss

SourceStrong coverageMissing piece
NATS docs: JetStream ConsumersAck semantics, redelivery, and max-delivery behavior for consumers.No explicit crash-window discussion for DLQ ordering around terminal ack paths.
NATS docs: JetStream Model Deep Dive`AckTerm` (`+TERM`) protocol meaning and acknowledgment models.Does not provide a production-safe DLQ-first sequencing pattern.
NATS docs: JetStream OverviewAt-least-once reliability model and acknowledgment caveats.No code-level poison-message handling template with durable forensic trace guarantees.

Cordum runtime mechanics

Cordum explicitly handles this failure window by writing DLQ first and terminating only after DLQ success.

BoundaryCurrent behaviorOperational impact
Delivery thresholdCordum inspects delivery metadata and terminates when redelivery count reaches configured max.Poison pills stop blocking queue progress.
OrderingCordum attempts DLQ write before `msg.Term()`.If DLQ write fails, message is NAKed with delay and retried.
Handler-level corruptionUnmarshal failures beyond threshold also route to termination path.Corrupt payloads do not loop forever.
ObservabilityWarnings log delivery counts and poison termination events.Operators can detect repeat offenders and scope replay.
Cordum poison-pill path
go
// core/infra/bus/nats.go (excerpt)
if numDelivered >= uint64(maxJSRedeliveries) {
  slog.Warn("bus: terminating poison message", ...)

  // DLQ write BEFORE Term — prevents data loss if we crash between Term and DLQ write.
  if b.OnMessageTerminated != nil {
    if dlqErr := b.OnMessageTerminated(subject, msg.Data, numDelivered); dlqErr != nil {
      slog.Error("bus: dlq write failed, nak-ing for retry", ...)
      _ = msg.NakWithDelay(5 * time.Second)
      return
    }
  }

  if termErr := msg.Term(); termErr != nil {
    slog.Error("bus: term failed", ...)
  }
  return
}

Why ordering matters

`AckTerm` is terminal for redelivery. After that, you cannot rely on JetStream to hand you the message again.

If evidence persistence happens after terminal ack, one crash can convert a known poison event into a silent hole.

Risky ordering example
go
// Risky ordering (do not use)
_ = msg.Term()
_ = writeToDLQ(data) // if process crashes before this line, failure evidence is gone

Validation runbook

Validate this in staging with forced restarts. A correctness guarantee that is not tested under crash timing is just optimism.

Crash-window validation steps
bash
# 1) Inject known poison payload into staging stream
# 2) Confirm delivery count reaches max threshold
# 3) Validate DLQ entry exists BEFORE termination
# 4) Force process restart during termination path to test crash resilience
# 5) Verify replay workflow can consume DLQ evidence

Limitations and tradeoffs

ApproachUpsideDownside
DLQ-first then Term (Cordum pattern)Best forensic integrity and replay support.Extra write in hot path and more failure handling branches.
Term-first then DLQSlightly shorter code path in success case.Crash window can lose poison message evidence.
Never Term, only NAKNo terminal drops.Queue starvation and retry storms for permanently bad payloads.

Next step

Add a poison-message integration test that kills the worker between DLQ write and term path boundaries, then assert no evidence loss before promoting the rollout.

Related Articles

View all posts

Need production-safe agent governance?

Cordum helps teams enforce pre-dispatch policy, run dependable agent workflows, and keep evidence trails auditable.