TL;DR
- Retries fail quietly when delay intent is dropped during error wrapping or refactors.
- Cordum maps retry intent by interface (`RetryDelay()`), not by concrete error type.
- `processBusMsg` routes retryable errors with delay > 0 to `msg.NakWithDelay(delay)`.
- Treat retry mapping as a contract with tests, not as helper-function trivia.
The production problem
Retry behavior breaks most often at boundaries, not inside retry helpers. The scheduler classifies a transient fault, but the bus handler no longer sees retry metadata and ACKs the message.
That bug is expensive because it looks like success from the queue perspective. No redelivery, no escalation, and no quick signal that work was skipped.
Cordum avoids this by carrying delay intent through an interface contract (`RetryDelay()`), then converting that into explicit JetStream action (`NakWithDelay`).
What top ranking sources cover vs miss
| Source | What it covers | What it misses |
|---|---|---|
| RFC 9110 (Retry-After) | Defines retry timing semantics as HTTP-date or delay-seconds for follow-up requests. | Does not explain internal propagation of retry delay across scheduler, queue handler, and broker acks. |
| gRPC Retry Guide | Client retry model, commit points, and retry policy controls like backoff and throttling. | No guidance for translating internal retry intent into JetStream `NakWithDelay` behavior. |
| NATS JetStream Consumers | Ack types, `AckWait`, `MaxDeliver`, `BackOff`, and explicit note that `nak` is immediate unless `nakWithDelay` is used. | No application-level contract for carrying retry delay from business error classification into broker nack calls. |
Retry intent contract
The control-plane contract is simple: classify transient failures once, preserve delay value, and map to queue action without reinterpretation.
| Stage | Mechanism | Risk if broken |
|---|---|---|
| Scheduler classification | `RetryAfter(err, delay)` in scheduler paths | Transient faults become terminal ACKs |
| Cross-package extraction | `errors.As` to interface with `RetryDelay() time.Duration` | Type-coupled refactors break delay extraction |
| Bus action mapping | `processBusMsg` maps retryable delay > 0 to `msgActionNakDelay` | Delay semantics collapse into immediate retries |
| JetStream ack path | `msg.NakWithDelay(delay)` | Redelivery pacing drifts from intended backoff |
| Corrupt payload handling | `poisonUnmarshalThreshold=3`, then `Term()` | Poison messages can starve consumers |
Failure modes
| Fault | Observed symptom | Operational effect |
|---|---|---|
| Error converted to plain string | `RetryDelay()` no longer discoverable by `errors.As` | Handler path ACKs non-retryable by mistake |
| Negative delay emitted | Delay is clamped to zero in retry helpers | Immediate NAK loop can spike redeliveries |
| No max delivery cap | Poison packets keep returning | Pending window gets crowded; useful work slows |
| NAK used without delay for transient faults | Tight retry loop under shared outage | Queue churn and noisy logs |
Implementation examples
// core/controlplane/scheduler/retry.go
func RetryAfter(err error, delay time.Duration) error {
if err == nil {
err = errors.New("retry requested")
}
if delay < 0 {
delay = 0
}
return &retryableError{err: err, delay: delay}
}
// core/infra/bus/retry.go
func RetryDelay(err error) (time.Duration, bool) {
type retryDelayProvider interface {
RetryDelay() time.Duration
}
var rd retryDelayProvider
if errors.As(err, &rd) {
delay := rd.RetryDelay()
if delay < 0 {
delay = 0
}
return delay, true
}
return 0, false
}// core/infra/bus/nats.go
const poisonUnmarshalThreshold uint64 = 3
func processBusMsg(data []byte, handler func(*pb.BusPacket) error, numDelivered uint64) (msgAction, time.Duration) {
var packet pb.BusPacket
if err := proto.Unmarshal(data, &packet); err != nil {
if numDelivered > poisonUnmarshalThreshold {
return msgActionTerm, 0
}
return msgActionNakDelay, 5 * time.Second
}
if err := handler(&packet); err != nil {
if delay, ok := RetryDelay(err); ok {
if delay > 0 {
return msgActionNakDelay, delay
}
return msgActionNak, 0
}
return msgActionAck, 0
}
return msgActionAck, 0
}
// Later in subscribe callback:
// case msgActionNakDelay:
// _ = msg.NakWithDelay(delay)# 1) Verify constants in bus runtime rg --line-number "maxJSRedeliveries|defaultAckWait|poisonUnmarshalThreshold" core/infra/bus/nats.go # 2) Verify retry contract exists in both scheduler and bus helpers rg --line-number "func RetryAfter(|RetryDelay() time.Duration" core/controlplane/scheduler/retry.go core/infra/bus/retry.go # 3) Verify mapping from RetryDelay -> NakWithDelay rg --line-number "processBusMsg|RetryDelay(|msgActionNakDelay|NakWithDelay" core/infra/bus/nats.go # 4) Verify docs claim matches runtime behavior rg --line-number "retry after|NAK-with-delay" docs/AGENT_PROTOCOL.md
Operational defaults
`defaultAckWait` is 10 minutes and max redeliveries are capped at 100 in JetStream mode. That gives long-running handlers room to finish while still bounding poison retries.
Unmarshal failures retry with 5-second delay until delivery 3, then terminate. It is a practical guard against queue starvation caused by corrupt payload loops.
| Control | Default | Why it exists |
|---|---|---|
| `defaultAckWait` | 10m | Gives worker handlers a large timeout window before redelivery |
| `maxJSRedeliveries` | 100 | Caps redelivery loops so poison messages cannot block forever |
| Poison threshold | 3 failed unmarshal deliveries | Moves from delayed retries to `Term()` when payload looks permanently corrupt |
| Corrupt payload delay | 5s | Prevents tight loop on transient decode anomalies |
| Interface extraction | `errors.As` + `RetryDelay()` | Keeps retry intent usable across package boundaries |
| Delay clamp | negative -> 0 | Blocks invalid negative durations from entering nack calls |
Limitations and tradeoffs
Bad retry delays can be as harmful as no retries. Too short causes churn; too long hurts latency.
The interface contract is robust but still optional by convention. Contract tests are mandatory if you want this to survive refactors.
Max-delivery caps protect throughput, but they also mean some transient issues will hit terminal handling sooner than teams expect.