Skip to content
Deep Dive

AI Agent DLQ Emission Reliability

Dead-letter handling is not done when you classify a failure. It is done when failure evidence is durable and queryable.

Deep Dive11 min readApr 2026
TL;DR
  • -Cordum emits DLQ data through two paths: optional durable `DLQSink.Add(...)` and bus publish to `sys.job.dlq`.
  • -`emitDLQWithRetry` performs one retry after 500ms, then increments `IncDLQEmitFailure(topic)` on final failure.
  • -If bus publish is down but sink is up, entry persistence still succeeds and API-based DLQ triage can continue.
  • -Generic DLQ guidance explains routing rules, but rarely covers DLQ-emission failure behavior as a first-class reliability concern.
Failure mode

Primary failure already happened. If DLQ emission fails too, operators lose forensic context exactly when they need it.

Current behavior

Cordum retries DLQ emission once after 500ms and records a dedicated metric on final failure.

Operational payoff

Sink-first write order reduces visibility loss when the event bus is degraded.

Scope

This guide focuses on DLQ emission reliability in the scheduler failure path, not generic queue retry policy tuning.

The production problem

Most teams treat DLQ as the end of a reliability story. It is usually the middle.

A failed job only becomes actionable if failure context survives transport incidents, broker hiccups, and store timeouts.

If DLQ emission fails during an outage, incident responders lose the evidence trail and spend the next hour reconstructing state from logs.

That reconstruction step is where MTTR gets expensive.

What top results cover and miss

SourceStrong coverageMissing piece
AWS SQS dead-letter queuesHow messages move to a DLQ after `maxReceiveCount`, plus retention and redrive guidance.Does not cover the reliability of the DLQ emission step itself when your own control plane publishes DLQ events.
Google Pub/Sub dead-letter topicsConfiguring dead-letter topics and max delivery attempts for undeliverable messages.No design pattern for dual-write DLQ persistence when bus and store can fail independently.
RabbitMQ dead-letter exchangesDLX routing mechanics and policy configuration details.Focuses on broker routing semantics, not application-level retry and telemetry when DLQ publishing fails.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
DLQ write order`emitDLQ` writes to `DLQSink` first (if configured), then publishes `JobResult` to `sys.job.dlq`.Durable store can still capture failures when bus publish is unavailable.
Retry policy`emitDLQWithRetry` retries once after `500ms` and then stops.Simple and bounded, but short outages can still drop bus-level DLQ events.
ObservabilityFinal retry failure increments `IncDLQEmitFailure(topic)` metric.You get a signal for DLQ-emission degradation instead of silent loss.
Timeout budgetSink add uses `storeOpTimeout = 2s` context deadline.Prevents long blocking in failure paths but can reject slow storage during incident conditions.
Test coverageHardening tests verify success-on-second-try, permanent-failure metric increment, and sink persistence when bus is down.Behavior is intentional and regression-protected.

DLQ code paths

Sink-first emit path

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
func (e *Engine) emitDLQ(jobID, topic string, status pb.JobStatus, reason, reasonCode string) error {
  if e.dlqSink != nil {
    ctx, cancel := context.WithTimeout(e.ctx, storeOpTimeout)
    err := e.dlqSink.Add(ctx, DLQEntry{JobID: jobID, Topic: topic, Status: status.String(), Reason: reason, ReasonCode: reasonCode})
    cancel()
    if err != nil {
      return fmt.Errorf("dlq sink add failed: %w", err)
    }
  }
  if e.bus == nil {
    return nil
  }
  return e.bus.Publish(dlqSubject, packet)
}

Single-retry wrapper

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
func (e *Engine) emitDLQWithRetry(jobID, topic string, status pb.JobStatus, reason, reasonCode string) error {
  err := e.emitDLQ(jobID, topic, status, reason, reasonCode)
  if err == nil {
    return nil
  }
  retryTimer := time.NewTimer(500 * time.Millisecond)
  select {
  case <-e.ctx.Done():
    retryTimer.Stop()
    return e.ctx.Err()
  case <-retryTimer.C:
  }
  err = e.emitDLQ(jobID, topic, status, reason, reasonCode)
  if err != nil && e.metrics != nil {
    e.metrics.IncDLQEmitFailure(topic)
  }
  return err
}

Hardening tests that lock behavior

core/controlplane/scheduler/engine_hardening_test.go
go
// core/controlplane/scheduler/engine_hardening_test.go (excerpt)
func TestDLQEmitRetrySucceedsOnSecondAttempt(t *testing.T) {
  bus := &failNBus{failCount: 1}
  err := engine.emitDLQWithRetry("job-retry", "job.test", pb.JobStatus_JOB_STATUS_FAILED, "test", "test_code")
  // expect success after one retry
}

func TestDLQEmitRetryPermanentFailureMetric(t *testing.T) {
  bus := &failNBus{failCount: 999}
  err := engine.emitDLQWithRetry("job-perm", "job.test", pb.JobStatus_JOB_STATUS_FAILED, "test", "test_code")
  // expect IncDLQEmitFailure called once
}

func TestDLQEmitPersistsToSinkWhenBusUnavailable(t *testing.T) {
  // bus publish fails, sink still stores entry
}

Validation runbook

Test DLQ-emission reliability with fault injection before you need it during an incident.

dlq-emission-reliability-runbook.sh
bash
# 1) Force bus publish failure while keeping DLQ sink healthy
# - verify DLQ API still lists new entries

# 2) Measure retry envelope
# - first emit failure
# - one retry after 500ms
# - no further retries

# 3) Alerting check
# - confirm dlq emit failure metric increments on final failure

# 4) Incident triage rule
# - if sink writes succeed but bus publish fails: use API/Redis-based DLQ triage
# - if sink and bus both fail: escalate as control-plane visibility incident

# 5) Recovery test
# - restore bus
# - verify new DLQ events resume without backlog corruption

Limitations and tradeoffs

ApproachUpsideDownside
One retry at 500ms (current)Bounded latency and simple failure behavior.Insufficient for outages longer than sub-second jitter windows.
Multi-attempt exponential retryHigher success probability during brief infrastructure turbulence.Longer scheduler critical-path latency and more retry traffic under systemic outage.
Sink-only durability with async bus fanoutStronger forensic durability when bus is unstable.Adds eventual-consistency complexity between storage and event-stream consumers.
  • - A single retry keeps code simple, but it assumes DLQ transport failures are short and rare.
  • - Sink-first ordering helps durability, but downstream consumers that only watch `sys.job.dlq` can still miss events during bus outages.
  • - Current tests validate core behavior, but there is still no long-outage soak test for repeated DLQ emit failures under sustained pressure.

Next step

Implement this next:

  1. 1. Add topic-scoped `dlq_emit_retry_attempts` and `dlq_emit_retry_delay` config knobs.
  2. 2. Emit a structured audit event whenever `IncDLQEmitFailure` increments.
  3. 3. Add a periodic reconciler that re-publishes sink-only DLQ entries when bus health recovers.
  4. 4. Run chaos drills where bus is down for 5-10 minutes and measure forensic data loss.

Continue with AI Agent DLQ Replay Patterns and AI Agent Poison Message Handling.

DLQ reliability is second-order reliability

First your job fails. Then your failure-reporting path fails. Design for the second failure before production does it for you.