Skip to content
Deep Dive

AI Agent Retry Intent Propagation

Retries fail quietly when intent gets lost between layers. In agent control planes, that usually appears as mystery ACK behavior.

Deep Dive11 min readMar 2026
TL;DR
  • -Cordum defines retryable errors in more than one package, but transport mapping still works via the `RetryDelay()` interface contract.
  • -`processBusMsg` maps retryable errors with positive delay to `msgActionNakDelay`, which calls JetStream `NakWithDelay(delay)`.
  • -If code path changes convert retry errors to plain strings, retry intent is lost and messages can be ACKed instead of retried.
  • -Add cross-package contract tests so retry semantics stay stable during refactors.
Failure mode

A transient error becomes non-retryable if retry metadata is stripped during error wrapping or conversion.

Current behavior

Retry intent is inferred by interface (`RetryDelay()`) rather than concrete error type coupling.

Operational payoff

You can evolve scheduler internals without breaking transport retries, as long as the contract remains intact.

Scope

This guide covers retry intent handoff from scheduler logic to bus transport behavior in JetStream-backed paths.

The production problem

A transient error is useless if the transport cannot tell it is transient.

In control planes, retries often cross three boundaries: scheduler logic, bus abstraction, and broker acknowledgment API.

If retry intent is dropped in that chain, a message can be ACKed and effectively lost from retry flow. No panic. No crash. Just wrong behavior.

This is why retry intent must be treated as a contract, not an implementation detail.

What top results cover and miss

SourceStrong coverageMissing piece
RFC 9110 Retry-AfterDefines retry timing semantics for HTTP follow-up requests using delay-seconds or HTTP-date.Does not address internal retry intent propagation across scheduler, message bus, and broker ack semantics.
gRPC Retry guideCovers retry policy knobs, jitter, throttling, and transparent retry behavior.Focuses on RPC clients, not broker-driven redelivery paths with explicit ack or nak decisions.
NATS JetStream ConsumersExplains AckWait, MaxDeliver, and that `nak` is immediate unless `nakWithDelay` is used.Does not describe how application error contracts should carry delay intent into `nakWithDelay`.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Scheduler error classificationScheduler returns `RetryAfter(err, delay)` for transient paths like lock contention or store/publish failures.Retry intent is declared where the failure is understood best.
Cross-package extractionBus helper uses `errors.As` against an interface exposing `RetryDelay() time.Duration`.Retry metadata survives package boundaries without sharing a concrete error type.
Message action mapping`processBusMsg` maps retryable delay > 0 to `msgActionNakDelay`; delay == 0 maps to immediate `Nak`.Delay semantics become explicit redelivery behavior.
JetStream execution`msgActionNakDelay` path calls `msg.NakWithDelay(delay)`.Broker redelivery pacing follows handler intent.
Protocol contractCordum protocol docs state retryable handler errors are translated into NAK-with-delay.Behavior is not just code convention; it is documented platform semantics.

Code path and failure mode

Scheduler declares retry intent

core/controlplane/scheduler/retry.go
go
// core/controlplane/scheduler/retry.go (excerpt)
type retryableError struct {
  err   error
  delay time.Duration
}

func (e *retryableError) RetryDelay() time.Duration { return e.delay }

func RetryAfter(err error, delay time.Duration) error {
  if err == nil {
    err = errors.New("retry requested")
  }
  if delay < 0 {
    delay = 0
  }
  return &retryableError{err: err, delay: delay}
}

Bus extracts delay by interface contract

core/infra/bus/retry.go
go
// core/infra/bus/retry.go (excerpt)
func RetryDelay(err error) (time.Duration, bool) {
  type retryDelayProvider interface {
    RetryDelay() time.Duration
  }
  var rd retryDelayProvider
  if errors.As(err, &rd) {
    delay := rd.RetryDelay()
    if delay < 0 {
      delay = 0
    }
    return delay, true
  }
  return 0, false
}

Transport maps to `NakWithDelay`

core/infra/bus/nats.go
go
// core/infra/bus/nats.go (excerpt)
if err := handler(&packet); err != nil {
  if delay, ok := RetryDelay(err); ok {
    if delay > 0 {
      return msgActionNakDelay, delay
    }
    return msgActionNak, 0
  }
  return msgActionAck, 0
}

case msgActionNakDelay:
  _ = msg.NakWithDelay(delay)

Common regression pattern

anti-pattern.go
go
// Intent-losing anti-pattern (do not do this)
if err := handler(&packet); err != nil {
  // %v stringifies and drops Unwrap chain + RetryDelay() contract
  return fmt.Errorf("handler failed: %v", err)
}

Note the subtlety: `fmt.Errorf("...: %w", err)` preserves unwrap semantics, while `fmt.Errorf("...: %v", err)` does not.

Validation runbook

Validate retry-intent continuity as a contract test, not only as unit behavior inside one package.

retry-intent-runbook.sh
bash
# 1) Contract test across package boundaries
# - create scheduler RetryAfter error
# - pass into bus RetryDelay
# - expect delay, ok=true

# 2) Transport mapping test
# - handler returns retryable error with 1500ms
# - expect processBusMsg => msgActionNakDelay + 1500ms

# 3) Wrapper safety test
# - wrap with fmt.Errorf("...: %w", err) => delay preserved
# - wrap with fmt.Errorf("...: %v", err) => delay lost

# 4) Runtime observability
# - watch logs for "bus: nak-with-delay"
# - sample delays and confirm they match scheduler policy classes

# 5) Refactor gate
# - block merges that alter RetryDelay contract without updating protocol docs

Limitations and tradeoffs

ApproachUpsideDownside
Interface contract (`RetryDelay()`)Loose coupling across packages and transports.Contract can be broken accidentally by unsafe wrapping or conversion.
Shared concrete retry error typeStronger compile-time coupling and simpler detection.Higher package coupling and harder long-term modularity.
String-based retry parsingFast to prototype.Brittle, locale-sensitive, and unsafe for production semantics.
  • - Interface contracts reduce coupling, but they demand strict wrapper hygiene in all intermediate layers.
  • - I found package-level tests for retry behavior, but no explicit cross-package contract test that scheduler retry errors survive bus extraction paths.
  • - Transport logs expose `nak-with-delay`, but they do not prove policy correctness unless you compare delays against scheduler intent classes.

Next step

Implement this next:

  1. 1. Add a cross-package test: scheduler `RetryAfter` error must be detected by bus `RetryDelay`.
  2. 2. Add a regression test for wrapper behavior (`%w` allowed, `%v` loses intent).
  3. 3. Export per-delay-class metrics from scheduler and compare with observed `nak-with-delay` logs.
  4. 4. Treat retry contract changes as protocol changes and update docs in the same pull request.

Continue with AI Agent Run Lock Busy Retries and AI Agent Safety Unavailable Retry Strategy.

Retry intent is part of your protocol

If a transient failure reaches the broker as a plain error, you did not fail at retry policy. You failed at error contract design.