The production problem
A transient error is useless if the transport cannot tell it is transient.
In control planes, retries often cross three boundaries: scheduler logic, bus abstraction, and broker acknowledgment API.
If retry intent is dropped in that chain, a message can be ACKed and effectively lost from retry flow. No panic. No crash. Just wrong behavior.
This is why retry intent must be treated as a contract, not an implementation detail.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| RFC 9110 Retry-After | Defines retry timing semantics for HTTP follow-up requests using delay-seconds or HTTP-date. | Does not address internal retry intent propagation across scheduler, message bus, and broker ack semantics. |
| gRPC Retry guide | Covers retry policy knobs, jitter, throttling, and transparent retry behavior. | Focuses on RPC clients, not broker-driven redelivery paths with explicit ack or nak decisions. |
| NATS JetStream Consumers | Explains AckWait, MaxDeliver, and that `nak` is immediate unless `nakWithDelay` is used. | Does not describe how application error contracts should carry delay intent into `nakWithDelay`. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Scheduler error classification | Scheduler returns `RetryAfter(err, delay)` for transient paths like lock contention or store/publish failures. | Retry intent is declared where the failure is understood best. |
| Cross-package extraction | Bus helper uses `errors.As` against an interface exposing `RetryDelay() time.Duration`. | Retry metadata survives package boundaries without sharing a concrete error type. |
| Message action mapping | `processBusMsg` maps retryable delay > 0 to `msgActionNakDelay`; delay == 0 maps to immediate `Nak`. | Delay semantics become explicit redelivery behavior. |
| JetStream execution | `msgActionNakDelay` path calls `msg.NakWithDelay(delay)`. | Broker redelivery pacing follows handler intent. |
| Protocol contract | Cordum protocol docs state retryable handler errors are translated into NAK-with-delay. | Behavior is not just code convention; it is documented platform semantics. |
Code path and failure mode
Scheduler declares retry intent
// core/controlplane/scheduler/retry.go (excerpt)
type retryableError struct {
err error
delay time.Duration
}
func (e *retryableError) RetryDelay() time.Duration { return e.delay }
func RetryAfter(err error, delay time.Duration) error {
if err == nil {
err = errors.New("retry requested")
}
if delay < 0 {
delay = 0
}
return &retryableError{err: err, delay: delay}
}Bus extracts delay by interface contract
// core/infra/bus/retry.go (excerpt)
func RetryDelay(err error) (time.Duration, bool) {
type retryDelayProvider interface {
RetryDelay() time.Duration
}
var rd retryDelayProvider
if errors.As(err, &rd) {
delay := rd.RetryDelay()
if delay < 0 {
delay = 0
}
return delay, true
}
return 0, false
}Transport maps to `NakWithDelay`
// core/infra/bus/nats.go (excerpt)
if err := handler(&packet); err != nil {
if delay, ok := RetryDelay(err); ok {
if delay > 0 {
return msgActionNakDelay, delay
}
return msgActionNak, 0
}
return msgActionAck, 0
}
case msgActionNakDelay:
_ = msg.NakWithDelay(delay)Common regression pattern
// Intent-losing anti-pattern (do not do this)
if err := handler(&packet); err != nil {
// %v stringifies and drops Unwrap chain + RetryDelay() contract
return fmt.Errorf("handler failed: %v", err)
}Note the subtlety: `fmt.Errorf("...: %w", err)` preserves unwrap semantics, while `fmt.Errorf("...: %v", err)` does not.
Validation runbook
Validate retry-intent continuity as a contract test, not only as unit behavior inside one package.
# 1) Contract test across package boundaries
# - create scheduler RetryAfter error
# - pass into bus RetryDelay
# - expect delay, ok=true
# 2) Transport mapping test
# - handler returns retryable error with 1500ms
# - expect processBusMsg => msgActionNakDelay + 1500ms
# 3) Wrapper safety test
# - wrap with fmt.Errorf("...: %w", err) => delay preserved
# - wrap with fmt.Errorf("...: %v", err) => delay lost
# 4) Runtime observability
# - watch logs for "bus: nak-with-delay"
# - sample delays and confirm they match scheduler policy classes
# 5) Refactor gate
# - block merges that alter RetryDelay contract without updating protocol docsLimitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Interface contract (`RetryDelay()`) | Loose coupling across packages and transports. | Contract can be broken accidentally by unsafe wrapping or conversion. |
| Shared concrete retry error type | Stronger compile-time coupling and simpler detection. | Higher package coupling and harder long-term modularity. |
| String-based retry parsing | Fast to prototype. | Brittle, locale-sensitive, and unsafe for production semantics. |
- - Interface contracts reduce coupling, but they demand strict wrapper hygiene in all intermediate layers.
- - I found package-level tests for retry behavior, but no explicit cross-package contract test that scheduler retry errors survive bus extraction paths.
- - Transport logs expose `nak-with-delay`, but they do not prove policy correctness unless you compare delays against scheduler intent classes.
Next step
Implement this next:
- 1. Add a cross-package test: scheduler `RetryAfter` error must be detected by bus `RetryDelay`.
- 2. Add a regression test for wrapper behavior (`%w` allowed, `%v` loses intent).
- 3. Export per-delay-class metrics from scheduler and compare with observed `nak-with-delay` logs.
- 4. Treat retry contract changes as protocol changes and update docs in the same pull request.
Continue with AI Agent Run Lock Busy Retries and AI Agent Safety Unavailable Retry Strategy.