AI Agent DLQ Emission Reliability: One Retry Is Not a Delivery Guarantee (2026)

The production problem

Most teams treat DLQ as the end of a reliability story. It is usually the middle.

A failed job only becomes actionable if failure context survives transport incidents, broker hiccups, and store timeouts.

If DLQ emission fails during an outage, incident responders lose the evidence trail and spend the next hour reconstructing state from logs.

That reconstruction step is where MTTR gets expensive.

What top results cover and miss

Source	Strong coverage	Missing piece
AWS SQS dead-letter queues	How messages move to a DLQ after `maxReceiveCount`, plus retention and redrive guidance.	Does not cover the reliability of the DLQ emission step itself when your own control plane publishes DLQ events.
Google Pub/Sub dead-letter topics	Configuring dead-letter topics and max delivery attempts for undeliverable messages.	No design pattern for dual-write DLQ persistence when bus and store can fail independently.
RabbitMQ dead-letter exchanges	DLX routing mechanics and policy configuration details.	Focuses on broker routing semantics, not application-level retry and telemetry when DLQ publishing fails.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
DLQ write order	`emitDLQ` writes to `DLQSink` first (if configured), then publishes `JobResult` to `sys.job.dlq`.	Durable store can still capture failures when bus publish is unavailable.
Retry policy	`emitDLQWithRetry` retries once after `500ms` and then stops.	Simple and bounded, but short outages can still drop bus-level DLQ events.
Observability	Final retry failure increments `IncDLQEmitFailure(topic)` metric.	You get a signal for DLQ-emission degradation instead of silent loss.
Timeout budget	Sink add uses `storeOpTimeout = 2s` context deadline.	Prevents long blocking in failure paths but can reject slow storage during incident conditions.
Test coverage	Hardening tests verify success-on-second-try, permanent-failure metric increment, and sink persistence when bus is down.	Behavior is intentional and regression-protected.

DLQ code paths

Sink-first emit path

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
func (e *Engine) emitDLQ(jobID, topic string, status pb.JobStatus, reason, reasonCode string) error {
  if e.dlqSink != nil {
    ctx, cancel := context.WithTimeout(e.ctx, storeOpTimeout)
    err := e.dlqSink.Add(ctx, DLQEntry{JobID: jobID, Topic: topic, Status: status.String(), Reason: reason, ReasonCode: reasonCode})
    cancel()
    if err != nil {
      return fmt.Errorf("dlq sink add failed: %w", err)
    }
  }
  if e.bus == nil {
    return nil
  }
  return e.bus.Publish(dlqSubject, packet)
}

Single-retry wrapper

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
func (e *Engine) emitDLQWithRetry(jobID, topic string, status pb.JobStatus, reason, reasonCode string) error {
  err := e.emitDLQ(jobID, topic, status, reason, reasonCode)
  if err == nil {
    return nil
  }
  retryTimer := time.NewTimer(500 * time.Millisecond)
  select {
  case <-e.ctx.Done():
    retryTimer.Stop()
    return e.ctx.Err()
  case <-retryTimer.C:
  }
  err = e.emitDLQ(jobID, topic, status, reason, reasonCode)
  if err != nil && e.metrics != nil {
    e.metrics.IncDLQEmitFailure(topic)
  }
  return err
}

Hardening tests that lock behavior

core/controlplane/scheduler/engine_hardening_test.go

// core/controlplane/scheduler/engine_hardening_test.go (excerpt)
func TestDLQEmitRetrySucceedsOnSecondAttempt(t *testing.T) {
  bus := &failNBus{failCount: 1}
  err := engine.emitDLQWithRetry("job-retry", "job.test", pb.JobStatus_JOB_STATUS_FAILED, "test", "test_code")
  // expect success after one retry
}

func TestDLQEmitRetryPermanentFailureMetric(t *testing.T) {
  bus := &failNBus{failCount: 999}
  err := engine.emitDLQWithRetry("job-perm", "job.test", pb.JobStatus_JOB_STATUS_FAILED, "test", "test_code")
  // expect IncDLQEmitFailure called once
}

func TestDLQEmitPersistsToSinkWhenBusUnavailable(t *testing.T) {
  // bus publish fails, sink still stores entry
}

Validation runbook

Test DLQ-emission reliability with fault injection before you need it during an incident.

dlq-emission-reliability-runbook.sh

bash

# 1) Force bus publish failure while keeping DLQ sink healthy
# - verify DLQ API still lists new entries

# 2) Measure retry envelope
# - first emit failure
# - one retry after 500ms
# - no further retries

# 3) Alerting check
# - confirm dlq emit failure metric increments on final failure

# 4) Incident triage rule
# - if sink writes succeed but bus publish fails: use API/Redis-based DLQ triage
# - if sink and bus both fail: escalate as control-plane visibility incident

# 5) Recovery test
# - restore bus
# - verify new DLQ events resume without backlog corruption

Limitations and tradeoffs

Approach	Upside	Downside
One retry at 500ms (current)	Bounded latency and simple failure behavior.	Insufficient for outages longer than sub-second jitter windows.
Multi-attempt exponential retry	Higher success probability during brief infrastructure turbulence.	Longer scheduler critical-path latency and more retry traffic under systemic outage.
Sink-only durability with async bus fanout	Stronger forensic durability when bus is unstable.	Adds eventual-consistency complexity between storage and event-stream consumers.

- A single retry keeps code simple, but it assumes DLQ transport failures are short and rare.
- Sink-first ordering helps durability, but downstream consumers that only watch `sys.job.dlq` can still miss events during bus outages.
- Current tests validate core behavior, but there is still no long-outage soak test for repeated DLQ emit failures under sustained pressure.

Next step

Implement this next:

1. Add topic-scoped `dlq_emit_retry_attempts` and `dlq_emit_retry_delay` config knobs.
2. Emit a structured audit event whenever `IncDLQEmitFailure` increments.
3. Add a periodic reconciler that re-publishes sink-only DLQ entries when bus health recovers.
4. Run chaos drills where bus is down for 5-10 minutes and measure forensic data loss.

Continue with AI Agent DLQ Replay Patterns and AI Agent Poison Message Handling.