Skip to content
Guide

AI Agent Poison Message Handling

Detect, quarantine, and replay failures without creating duplicate side effects.

Guide11 min readMar 2026
TL;DR
  • -Poison messages are normal in production; silent retries are the expensive failure mode.
  • -Separate transient failures from terminal failures with explicit reason codes.
  • -Replay should pass policy checks and idempotency rules, not bypass them.
  • -Use concrete thresholds: retry budgets, backoff windows, and DLQ escalation criteria.
Deterministic triage

Treat failures as classes with fixed handling paths, not ad-hoc retries.

Replay governance

Force replay through policy and idempotency checks before side effects.

Operational limits

Bound retries with measurable budgets to avoid infinite poison loops.

Scope

This guide targets autonomous systems where queue messages can trigger state-changing actions in external systems, not just in-memory processing.

The production problem

A poison message is not just a malformed payload. It is any message that keeps failing and burns throughput while making no progress.

In autonomous agent systems, that failure can be expensive. A bad replay loop can open duplicate tickets, send repeated tool calls, and flood on-call with noise instead of signal.

Most incidents are not caused by one failure. They come from unbounded retries, weak classification, and replay paths that skip policy controls.

What top results miss

SourceStrong coverageMissing piece
Amazon SQS dead-letter queuesClear redrive policy guidance (`maxReceiveCount`) and retention caveats for standard vs FIFO.No governance model for autonomous tool execution replay.
RabbitMQ Dead Letter ExchangesPrecise dead-letter triggers and policy-vs-argument configuration tradeoffs.No policy-gated replay path for agent side effects outside the broker.
Google Pub/Sub dead-letter topicsConcrete delivery-attempt limits (5-100) and subscription-level dead lettering controls.No run-level idempotency strategy for autonomous workflows after redelivery.

Poison message taxonomy

Do not route all failures to one bucket. Classify by recovery probability and side-effect risk, then bind each class to one action.

Failure classPrimary signalActionRisk if wrong
Transient infra failureTimeouts, temporary dependency outage, lock contentionRetry with jittered backoff and strict max-attempt budgetRetry storm if no cap
Schema or payload failureDecode error, missing required fields, malformed context pointerFail fast to DLQ with reason code and payload fingerprintInfinite failures if retried blindly
Policy denialSafety deny, missing approval, blocked capability/risk tagDo not auto-retry; require policy change or explicit approvalUnsafe bypass if replay ignores policy
Poison side effectRepeated external 4xx/semantic conflict despite retriesQuarantine and require idempotency/correction before replayDuplicate tickets, PRs, or infrastructure mutations

Cordum runtime implications

ImplicationCurrent behaviorWhy it matters
Dispatch retry ceiling50 scheduling attempts with exponential backoff from 1s to 30sRetry loops are bounded before terminal DLQ to prevent unbounded churn.
Bus-level redeliveryJetStream at-least-once with AckWait 10m and MaxDeliver 100Consumer code must assume duplicate delivery and remain replay-safe.
DLQ-first terminationDLQ write callback runs before message termination; on write error, NAK with 5s delayPrevents message loss in crash windows between termination and persistence.
Replay API workflowDLQ retry endpoint rehydrates context into a new job id and re-dispatchesReplay becomes explicit, auditable, and policy-controllable.

Practical baseline: keep retry budgets finite, publish terminal failures to DLQ with reason codes, and force replay through policy + idempotency checks.

Implementation examples

Failure classifier (Go)

classifier.go
Go
type Action string

const (
  ActionRetry      Action = "retry"
  ActionToDLQ      Action = "dlq"
  ActionNeedReview Action = "manual_review"
)

func classifyFailure(code string, attempts int, maxAttempts int) Action {
  switch code {
  case "timeout", "dependency_unavailable", "store_lock_busy":
    if attempts < maxAttempts {
      return ActionRetry
    }
    return ActionToDLQ

  case "schema_invalid", "payload_unmarshal_failed", "no_pool_mapping":
    return ActionToDLQ

  case "safety_denied", "approval_required":
    return ActionNeedReview

  default:
    return ActionToDLQ
  }
}

Replay governance policy (YAML)

replay-controls.yaml
YAML
replay_controls:
  max_retries_per_message: 3
  require_policy_check: true
  require_idempotency_key: true
  auto_retry:
    allowed_reason_codes:
      - timeout
      - dependency_unavailable
      - store_lock_busy
    denied_reason_codes:
      - safety_denied
      - schema_invalid
      - payload_unmarshal_failed

DLQ operations (cURL)

dlq-ops.sh
Bash
# List DLQ entries
curl -sS http://localhost:8081/api/v1/dlq   -H "X-API-Key: ${API_KEY}"   -H "X-Tenant-ID: default"

# Retry one entry (creates a new job id)
curl -sS -X POST http://localhost:8081/api/v1/dlq/JOB_ID/retry   -H "X-API-Key: ${API_KEY}"   -H "X-Tenant-ID: default"

Limitations and tradeoffs

  • - Aggressive fail-fast classification can move recoverable messages to DLQ too early.
  • - Long retry windows reduce DLQ noise but increase queue latency under partial outages.
  • - Manual replay reviews reduce blast radius but add operational overhead.
  • - Replay safety depends on idempotent external systems, not just broker semantics.

Next step

Run this in one sprint:

  1. 1. Define 4-6 terminal reason codes and map each to retry, DLQ, or manual review.
  2. 2. Enforce max-attempt budgets per class instead of one global retry policy.
  3. 3. Add a replay checklist: policy check, idempotency key, side-effect simulation.
  4. 4. Track DLQ depth, replay success rate, and duplicate-detected rate as release gates.

Continue with AI Agent DLQ and Replay Patterns and AI Agent Idempotency Keys.

Poison messages are a control-plane problem

Broker-level DLQ is necessary. Governance-aware replay is what keeps autonomous execution safe.