Skip to content
Guide

AI Agent DLQ and Replay Patterns for Production

Dead-letter queues should close incidents, not collect dust.

Guide11 min readMar 2026
TL;DR
  • -A DLQ without replay policy is a failure archive, not a recovery system.
  • -Replay must run through fresh policy checks with idempotency keys, not blind redrive.
  • -DLQ quality depends on reason codes and evidence fields you can query in minutes.
Actionable DLQ

Store reason codes that drive deterministic replay decisions

Safe replay

Run policy evaluation again before re-dispatch

Fast triage

Separate transient, poison, and governance failures

Scope

This guide covers DLQ handling for autonomous agent jobs that execute real actions across external tools and internal systems.

The production problem

Most teams treat DLQ as a final bucket. Messages go in, dashboards go red, and operators manually click retry until the queue moves.

That approach fails for autonomous agents. Replaying a side-effecting action without policy review and idempotency can repeat damage, not fix it.

What top results miss

SourceStrong coverageMissing piece
Amazon SQS dead-letter queuesClear redrive configuration and retention guidance (`maxReceiveCount`, retention windows).No agent-specific replay governance model for side-effecting autonomous actions.
Azure Service Bus dead letteringPractical enablement paths (Portal, CLI, PowerShell, ARM/Bicep).Limited runtime triage patterns for safe replay and incident response.
RabbitMQ Dead Letter ExchangesStrong low-level DLX routing details, cycle risks, and policy controls.No control-plane guidance for replay approvals and cross-system audit evidence.

Failure triage model

Failure classSignalReplay policy
Transient infrastructuretimeout, dependency_unavailable, no_workersAuto replay with exponential backoff and same idempotency key
Poison payloadschema_invalid, deserialization_errorDo not replay until payload is fixed or transformed
Policy/governancedenied, approval_required, policy_snapshot_mismatchRe-evaluate against latest policy and require human approval if risk is high
Side-effect uncertaintyunknown_commit_state, partial_external_writeRun compensating check first, then replay only with manual override

Cordum DLQ runtime behavior

Cordum stores structured DLQ entries and prioritizes DLQ persistence before message termination. That design avoids the worst-case failure mode: lost message with no forensic trail.

ControlCurrent behaviorWhy it matters
DLQ emissionScheduler emits DLQ for terminal failures except `FAILED_RETRYABLE`Avoids polluting DLQ with failures still eligible for retry.
Retry boundary50 scheduling attempts, 1s-30s exponential backoff, then DLQProvides deterministic cutoff before human/operator triage.
DLQ-first terminationWrite DLQ entry before `msg.Term()`; on write failure use `NakWithDelay(5s)`Prevents silent message loss when termination succeeds but persistence fails.
Storage modelRedis keys `dlq:entry:<job_id>` plus sorted index `dlq:index`Supports fast querying and replay pipelines by status/reason/time.

Implementation examples

DLQ triage policy (YAML)

dlq-triage.yaml
YAML
version: v1
dlq_triage:
  - match:
      error_code: timeout
    action: auto_replay
    backoff: exponential
    max_replays: 3

  - match:
      error_code: schema_invalid
    action: quarantine

  - match:
      error_code: denied
    action: require_human

Replay worker skeleton (Go)

replay_worker.go
Go
func ReplayDLQEntry(ctx context.Context, entry DLQEntry) error {
  if wasAlreadyProcessed(entry.IdempotencyKey) {
    return nil
  }

  decision, err := policyClient.Evaluate(ctx, entry.JobRequest)
  if err != nil {
    return err
  }
  if decision == "deny" {
    return fmt.Errorf("replay denied by policy")
  }

  return dispatch(entry.JobRequest) // keep original idempotency key
}

DLQ entry shape for triage (JSON)

dlq-entry.json
JSON
{
  "job_id": "job_74c2",
  "topic": "tool.github.pr.create",
  "last_state": "FAILED",
  "error_code": "dependency_unavailable",
  "attempts": 50,
  "idempotency_key": "run_2f91:step_3",
  "policy_snapshot": "sha256:ab91...",
  "replay_status": "pending_review"
}

Limitations and tradeoffs

  • - Aggressive auto-replay reduces operator load but can hide systemic defects longer.
  • - Strict replay approvals improve safety but increase mean time to recovery for low-risk failures.
  • - DLQ reason-code quality depends on producer discipline; vague error messages destroy triage speed.
  • - Replay policy drift between teams can reintroduce manual, inconsistent incident handling.

Next step

Run this in one sprint:

  1. 1. Define a DLQ reason-code taxonomy with replay actions per code.
  2. 2. Require policy evaluation and idempotency checks in the replay pipeline.
  3. 3. Add an alert on DLQ growth rate, not only absolute depth.
  4. 4. Rehearse one incident where replay is intentionally denied by policy.

Continue with AI Agent Rollback and Compensation and AI Agent Circuit Breaker Pattern.

Replay with intent

If replay means “try again and hope,” your DLQ is writing tomorrow’s incident report today.