Skip to content
AI Agent Reliability

AI Agent Retry Intent Propagation (2026): From `RetryAfter` to JetStream `NakWithDelay`

Your retry policy can look correct on paper and still fail in production if delay intent disappears between scheduler and broker. This guide shows how to preserve the contract end to end.

Reliability13 min readUpdated Apr 2026

TL;DR

  • Retries fail quietly when delay intent is dropped during error wrapping or refactors.
  • Cordum maps retry intent by interface (`RetryDelay()`), not by concrete error type.
  • `processBusMsg` routes retryable errors with delay > 0 to `msg.NakWithDelay(delay)`.
  • Treat retry mapping as a contract with tests, not as helper-function trivia.

The production problem

Retry behavior breaks most often at boundaries, not inside retry helpers. The scheduler classifies a transient fault, but the bus handler no longer sees retry metadata and ACKs the message.

That bug is expensive because it looks like success from the queue perspective. No redelivery, no escalation, and no quick signal that work was skipped.

Cordum avoids this by carrying delay intent through an interface contract (`RetryDelay()`), then converting that into explicit JetStream action (`NakWithDelay`).

What top ranking sources cover vs miss

SourceWhat it coversWhat it misses
RFC 9110 (Retry-After)Defines retry timing semantics as HTTP-date or delay-seconds for follow-up requests.Does not explain internal propagation of retry delay across scheduler, queue handler, and broker acks.
gRPC Retry GuideClient retry model, commit points, and retry policy controls like backoff and throttling.No guidance for translating internal retry intent into JetStream `NakWithDelay` behavior.
NATS JetStream ConsumersAck types, `AckWait`, `MaxDeliver`, `BackOff`, and explicit note that `nak` is immediate unless `nakWithDelay` is used.No application-level contract for carrying retry delay from business error classification into broker nack calls.

Retry intent contract

The control-plane contract is simple: classify transient failures once, preserve delay value, and map to queue action without reinterpretation.

StageMechanismRisk if broken
Scheduler classification`RetryAfter(err, delay)` in scheduler pathsTransient faults become terminal ACKs
Cross-package extraction`errors.As` to interface with `RetryDelay() time.Duration`Type-coupled refactors break delay extraction
Bus action mapping`processBusMsg` maps retryable delay > 0 to `msgActionNakDelay`Delay semantics collapse into immediate retries
JetStream ack path`msg.NakWithDelay(delay)`Redelivery pacing drifts from intended backoff
Corrupt payload handling`poisonUnmarshalThreshold=3`, then `Term()`Poison messages can starve consumers

Failure modes

FaultObserved symptomOperational effect
Error converted to plain string`RetryDelay()` no longer discoverable by `errors.As`Handler path ACKs non-retryable by mistake
Negative delay emittedDelay is clamped to zero in retry helpersImmediate NAK loop can spike redeliveries
No max delivery capPoison packets keep returningPending window gets crowded; useful work slows
NAK used without delay for transient faultsTight retry loop under shared outageQueue churn and noisy logs

Implementation examples

retry-contract.go
Go
// core/controlplane/scheduler/retry.go
func RetryAfter(err error, delay time.Duration) error {
  if err == nil {
    err = errors.New("retry requested")
  }
  if delay < 0 {
    delay = 0
  }
  return &retryableError{err: err, delay: delay}
}

// core/infra/bus/retry.go
func RetryDelay(err error) (time.Duration, bool) {
  type retryDelayProvider interface {
    RetryDelay() time.Duration
  }
  var rd retryDelayProvider
  if errors.As(err, &rd) {
    delay := rd.RetryDelay()
    if delay < 0 {
      delay = 0
    }
    return delay, true
  }
  return 0, false
}
retry-mapping.go
Go
// core/infra/bus/nats.go
const poisonUnmarshalThreshold uint64 = 3

func processBusMsg(data []byte, handler func(*pb.BusPacket) error, numDelivered uint64) (msgAction, time.Duration) {
  var packet pb.BusPacket
  if err := proto.Unmarshal(data, &packet); err != nil {
    if numDelivered > poisonUnmarshalThreshold {
      return msgActionTerm, 0
    }
    return msgActionNakDelay, 5 * time.Second
  }

  if err := handler(&packet); err != nil {
    if delay, ok := RetryDelay(err); ok {
      if delay > 0 {
        return msgActionNakDelay, delay
      }
      return msgActionNak, 0
    }
    return msgActionAck, 0
  }
  return msgActionAck, 0
}

// Later in subscribe callback:
// case msgActionNakDelay:
//   _ = msg.NakWithDelay(delay)
validation-runbook.sh
Bash
# 1) Verify constants in bus runtime
rg --line-number "maxJSRedeliveries|defaultAckWait|poisonUnmarshalThreshold" core/infra/bus/nats.go

# 2) Verify retry contract exists in both scheduler and bus helpers
rg --line-number "func RetryAfter(|RetryDelay() time.Duration" core/controlplane/scheduler/retry.go core/infra/bus/retry.go

# 3) Verify mapping from RetryDelay -> NakWithDelay
rg --line-number "processBusMsg|RetryDelay(|msgActionNakDelay|NakWithDelay" core/infra/bus/nats.go

# 4) Verify docs claim matches runtime behavior
rg --line-number "retry after|NAK-with-delay" docs/AGENT_PROTOCOL.md

Operational defaults

Bus defaults

`defaultAckWait` is 10 minutes and max redeliveries are capped at 100 in JetStream mode. That gives long-running handlers room to finish while still bounding poison retries.

Decode protection

Unmarshal failures retry with 5-second delay until delivery 3, then terminate. It is a practical guard against queue starvation caused by corrupt payload loops.

ControlDefaultWhy it exists
`defaultAckWait`10mGives worker handlers a large timeout window before redelivery
`maxJSRedeliveries`100Caps redelivery loops so poison messages cannot block forever
Poison threshold3 failed unmarshal deliveriesMoves from delayed retries to `Term()` when payload looks permanently corrupt
Corrupt payload delay5sPrevents tight loop on transient decode anomalies
Interface extraction`errors.As` + `RetryDelay()`Keeps retry intent usable across package boundaries
Delay clampnegative -> 0Blocks invalid negative durations from entering nack calls

Limitations and tradeoffs

Delay tuning debt

Bad retry delays can be as harmful as no retries. Too short causes churn; too long hurts latency.

Contract fragility

The interface contract is robust but still optional by convention. Contract tests are mandatory if you want this to survive refactors.

Bounded retries

Max-delivery caps protect throughput, but they also mean some transient issues will hit terminal handling sooner than teams expect.

Frequently Asked Questions

Why not use concrete error types everywhere?
Concrete type coupling across packages tends to break during refactors. The interface contract (`RetryDelay()`) keeps mapping stable with less coupling.
Is `Nak()` enough for transient failures?
Usually no. Immediate retries can hammer shared dependencies. Delay-aware NAKs reduce churn and protect downstream systems.
How do we detect broken retry intent fast?
Add contract tests that force scheduler `RetryAfter` errors through `processBusMsg` and assert `msgActionNakDelay` with the expected duration.
What if we acknowledge non-retryable errors by design?
That is valid. Just make sure classification is explicit and covered by tests so transient failures do not silently fall into the non-retryable bucket.
Next step

Add one contract test this week that injects a scheduler `RetryAfter` error and asserts JetStream action is delayed NAK with the same duration. That single test prevents expensive silent regressions.

Sources