Skip to content
Guide

AI Agent DLQ and Replay Patterns for Production

A dead-letter queue is only useful if it feeds safe, repeatable recovery.

Guide11 min readApr 2026
TL;DR
  • -DLQ replay without fresh policy checks can duplicate side effects.
  • -Cordum caps scheduling retries at 50 (~25 minutes), then emits DLQ metadata with reason codes.
  • -Fast incident recovery needs triage taxonomy, idempotency keys, and a human gate for uncertain commit state.
Actionable DLQ

Reason codes and evidence fields drive deterministic replay decisions

Safe replay

Replay runs through policy evaluation again before dispatch

Fast triage

Separate transient, poison, governance, and uncertain side-effect failures

Scope

This guide focuses on autonomous agent jobs that can create side effects in external systems. If replay is wrong, production state can diverge from intent.

The production problem

A generic DLQ setup assumes retrying is cheap. Autonomous agents break that assumption. A replayed tool call can create a second ticket, send a second email, or push a second config change.

Example: 1,000 nightly jobs, 2% failures. Blind replay of 20 jobs is manageable when failures are transient. It is dangerous when even 2-3 are in unknown commit state.

The goal is not “empty DLQ quickly.” The goal is “recover safely with an audit trail.”

What top results miss

SourceStrong coverageMissing piece
Amazon SQS DLQ redrive (Developer Guide)Operational redrive controls: destination queue choice, velocity control, task limits, and queue permissions.No policy re-evaluation model for autonomous agents before replaying side-effecting actions.
AWS Compute Blog: replay with backoffPractical replay counter and exponential backoff with jitter, plus a final human-operated queue.No cross-system idempotency contract or governance gate tied to replay risk class.
RabbitMQ at-least-once dead letteringCritical durability tradeoff between at-most-once and at-least-once dead lettering.No control-plane workflow for approval-required replay when commit state is uncertain.

Failure triage model

Treat replay as a policy decision, not a queue operation.

Failure classSignalReplay policy
Transient infrastructuretimeout, dependency_unavailable, no_workersAuto replay with capped exponential backoff using the same idempotency key
Poison payloadschema_invalid, parse_error, tool_contract_mismatchDo not replay until payload or adapter mapping is fixed
Policy and governancedenied, approval_required, policy_snapshot_mismatchRe-evaluate against latest policy and require explicit approval for high risk
Unknown commit statepartial_external_write, unknown_commit_stateRun compensating read/check first; default to manual decision

Cordum runtime evidence

These controls are verified against current source and docs, not inferred from marketing copy.

ControlCurrent behaviorEvidenceWhy it matters
Retry ceiling before DLQScheduler caps scheduling attempts at 50 with 1s-30s backoff (~25 min).core/controlplane/scheduler/engine.goPuts a deterministic boundary on auto-retry loops.
Retry exhaustion reason codeAfter ceiling, state moves to FAILED and DLQ emit uses reason `max_scheduling_retries`.core/controlplane/scheduler/engine.goMakes replay routing rule-based instead of log scraping.
DLQ persistence orderingBus writes DLQ before `msg.Term()`. DLQ write failure triggers `NakWithDelay(5s)`.core/infra/bus/nats.goReduces message-loss risk between termination and DLQ persistence.
DLQ storage layoutRedis stores `dlq:entry:<jobID>` plus `dlq:index`; default entry TTL is 30 days.docs/redis-operations.mdEnables fast triage queries while bounding storage growth.

Known caveat (important)

`TestDLQEmitFailureDoesNotBlockStateTransition` documents a bug path: terminal state can be written before DLQ emit succeeds. If emit fails and redelivery sees terminal state, DLQ entry can be missed. Plan replay runbooks accordingly.

Implementation examples

DLQ triage policy (YAML)

dlq-triage.yaml
YAML
version: v1
dlq_rules:
  - match:
      reason_code: max_scheduling_retries
    action: auto_replay
    max_replays: 2
    backoff: exponential_jitter

  - match:
      reason_code: schema_invalid
    action: quarantine
    owner: integration-team

  - match:
      reason_code: denied
    action: require_approval
    approver_group: platform-governance

  - match:
      reason_code: unknown_commit_state
    action: manual_review

Replay worker skeleton (Go)

replay_worker.go
Go
func ReplayDLQ(ctx context.Context, entry DLQEntry) error {
  if AlreadyCommitted(entry.IdempotencyKey) {
    return nil
  }

  decision, err := policyClient.Evaluate(ctx, entry.Request)
  if err != nil {
    return fmt.Errorf("policy evaluate: %w", err)
  }
  if decision == "deny" {
    return fmt.Errorf("replay denied by policy")
  }

  // Preserve idempotency key to avoid duplicate side effects.
  entry.Request.IdempotencyKey = entry.IdempotencyKey
  return dispatcher.Dispatch(ctx, entry.Request)
}

DLQ entry shape for triage (JSON)

dlq-entry.json
JSON
{
  "job_id": "job_74c2",
  "topic": "tool.github.pr.create",
  "status": "FAILED",
  "reason_code": "max_scheduling_retries",
  "reason": "max scheduling retries exceeded (attempts=50)",
  "attempts": 50,
  "idempotency_key": "run_2f91:step_3",
  "policy_snapshot": "sha256:ab91...",
  "replay_status": "pending_review"
}

Limitations and tradeoffs

  • - Auto replay lowers pager load, but it can delay detection of systemic defects.
  • - Strict approval gates improve safety, but they increase MTTR for low-risk transient failures.
  • - Reason-code quality is a hard dependency; vague errors turn replay into guesswork.
  • - Current consistency tests document a DLQ caveat: if DLQ emit fails after terminal state transition, the job can stay terminal while the DLQ record is lost (`engine_consistency_test.go`, BUG-8).
  • - DLQ add/trim is not fully transactional in documented tests; TTL mitigates orphaned entries but does not remove the risk entirely.

Next step

Run this as a one-sprint reliability upgrade:

  1. 1. Define reason-code taxonomy and default replay action per reason.
  2. 2. Enforce policy re-evaluation plus idempotency on every replay request.
  3. 3. Alert on DLQ growth rate and replay-denial rate, not only queue depth.
  4. 4. Rehearse one incident where replay is intentionally denied by policy.

Continue with AI Agent Rollback and Compensation and AI Agent Circuit Breaker Pattern.

Replay with intent

If replay means “try again and hope,” your DLQ is writing tomorrow’s incident report today.