Name: Cordum
Author: Cordum

The production problem

Most teams treat DLQ as a final bucket. Messages go in, dashboards go red, and operators manually click retry until the queue moves.

That approach fails for autonomous agents. Replaying a side-effecting action without policy review and idempotency can repeat damage, not fix it.

What top results miss

Source	Strong coverage	Missing piece
Amazon SQS dead-letter queues	Clear redrive configuration and retention guidance (`maxReceiveCount`, retention windows).	No agent-specific replay governance model for side-effecting autonomous actions.
Azure Service Bus dead lettering	Practical enablement paths (Portal, CLI, PowerShell, ARM/Bicep).	Limited runtime triage patterns for safe replay and incident response.
RabbitMQ Dead Letter Exchanges	Strong low-level DLX routing details, cycle risks, and policy controls.	No control-plane guidance for replay approvals and cross-system audit evidence.

Failure triage model

Failure class	Signal	Replay policy
Transient infrastructure	timeout, dependency_unavailable, no_workers	Auto replay with exponential backoff and same idempotency key
Poison payload	schema_invalid, deserialization_error	Do not replay until payload is fixed or transformed
Policy/governance	denied, approval_required, policy_snapshot_mismatch	Re-evaluate against latest policy and require human approval if risk is high
Side-effect uncertainty	unknown_commit_state, partial_external_write	Run compensating check first, then replay only with manual override

Cordum DLQ runtime behavior

Cordum stores structured DLQ entries and prioritizes DLQ persistence before message termination. That design avoids the worst-case failure mode: lost message with no forensic trail.

Control	Current behavior	Why it matters
DLQ emission	Scheduler emits DLQ for terminal failures except `FAILED_RETRYABLE`	Avoids polluting DLQ with failures still eligible for retry.
Retry boundary	50 scheduling attempts, 1s-30s exponential backoff, then DLQ	Provides deterministic cutoff before human/operator triage.
DLQ-first termination	Write DLQ entry before `msg.Term()`; on write failure use `NakWithDelay(5s)`	Prevents silent message loss when termination succeeds but persistence fails.
Storage model	Redis keys `dlq:entry:<job_id>` plus sorted index `dlq:index`	Supports fast querying and replay pipelines by status/reason/time.

Implementation examples

DLQ triage policy (YAML)

dlq-triage.yaml

YAML

version: v1
dlq_triage:
  - match:
      error_code: timeout
    action: auto_replay
    backoff: exponential
    max_replays: 3

  - match:
      error_code: schema_invalid
    action: quarantine

  - match:
      error_code: denied
    action: require_human

Replay worker skeleton (Go)

replay_worker.go

func ReplayDLQEntry(ctx context.Context, entry DLQEntry) error {
  if wasAlreadyProcessed(entry.IdempotencyKey) {
    return nil
  }

  decision, err := policyClient.Evaluate(ctx, entry.JobRequest)
  if err != nil {
    return err
  }
  if decision == "deny" {
    return fmt.Errorf("replay denied by policy")
  }

  return dispatch(entry.JobRequest) // keep original idempotency key
}

DLQ entry shape for triage (JSON)

dlq-entry.json

JSON

{
  "job_id": "job_74c2",
  "topic": "tool.github.pr.create",
  "last_state": "FAILED",
  "error_code": "dependency_unavailable",
  "attempts": 50,
  "idempotency_key": "run_2f91:step_3",
  "policy_snapshot": "sha256:ab91...",
  "replay_status": "pending_review"
}

Limitations and tradeoffs

- Aggressive auto-replay reduces operator load but can hide systemic defects longer.
- Strict replay approvals improve safety but increase mean time to recovery for low-risk failures.
- DLQ reason-code quality depends on producer discipline; vague error messages destroy triage speed.
- Replay policy drift between teams can reintroduce manual, inconsistent incident handling.

Next step

Run this in one sprint:

1. Define a DLQ reason-code taxonomy with replay actions per code.
2. Require policy evaluation and idempotency checks in the replay pipeline.
3. Add an alert on DLQ growth rate, not only absolute depth.
4. Rehearse one incident where replay is intentionally denied by policy.

Continue with AI Agent Rollback and Compensation and AI Agent Circuit Breaker Pattern.

AI Agent DLQ and Replay Patterns for Production