Name: Cordum
Author: Cordum

The production problem

A transient error is useless if the transport cannot tell it is transient.

In control planes, retries often cross three boundaries: scheduler logic, bus abstraction, and broker acknowledgment API.

If retry intent is dropped in that chain, a message can be ACKed and effectively lost from retry flow. No panic. No crash. Just wrong behavior.

This is why retry intent must be treated as a contract, not an implementation detail.

What top results cover and miss

Source	Strong coverage	Missing piece
RFC 9110 Retry-After	Defines retry timing semantics for HTTP follow-up requests using delay-seconds or HTTP-date.	Does not address internal retry intent propagation across scheduler, message bus, and broker ack semantics.
gRPC Retry guide	Covers retry policy knobs, jitter, throttling, and transparent retry behavior.	Focuses on RPC clients, not broker-driven redelivery paths with explicit ack or nak decisions.
NATS JetStream Consumers	Explains AckWait, MaxDeliver, and that `nak` is immediate unless `nakWithDelay` is used.	Does not describe how application error contracts should carry delay intent into `nakWithDelay`.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Scheduler error classification	Scheduler returns `RetryAfter(err, delay)` for transient paths like lock contention or store/publish failures.	Retry intent is declared where the failure is understood best.
Cross-package extraction	Bus helper uses `errors.As` against an interface exposing `RetryDelay() time.Duration`.	Retry metadata survives package boundaries without sharing a concrete error type.
Message action mapping	`processBusMsg` maps retryable delay > 0 to `msgActionNakDelay`; delay == 0 maps to immediate `Nak`.	Delay semantics become explicit redelivery behavior.
JetStream execution	`msgActionNakDelay` path calls `msg.NakWithDelay(delay)`.	Broker redelivery pacing follows handler intent.
Protocol contract	Cordum protocol docs state retryable handler errors are translated into NAK-with-delay.	Behavior is not just code convention; it is documented platform semantics.

Code path and failure mode

Scheduler declares retry intent

core/controlplane/scheduler/retry.go

// core/controlplane/scheduler/retry.go (excerpt)
type retryableError struct {
  err   error
  delay time.Duration
}

func (e *retryableError) RetryDelay() time.Duration { return e.delay }

func RetryAfter(err error, delay time.Duration) error {
  if err == nil {
    err = errors.New("retry requested")
  }
  if delay < 0 {
    delay = 0
  }
  return &retryableError{err: err, delay: delay}
}

Bus extracts delay by interface contract

core/infra/bus/retry.go

// core/infra/bus/retry.go (excerpt)
func RetryDelay(err error) (time.Duration, bool) {
  type retryDelayProvider interface {
    RetryDelay() time.Duration
  }
  var rd retryDelayProvider
  if errors.As(err, &rd) {
    delay := rd.RetryDelay()
    if delay < 0 {
      delay = 0
    }
    return delay, true
  }
  return 0, false
}

Transport maps to `NakWithDelay`

core/infra/bus/nats.go

// core/infra/bus/nats.go (excerpt)
if err := handler(&packet); err != nil {
  if delay, ok := RetryDelay(err); ok {
    if delay > 0 {
      return msgActionNakDelay, delay
    }
    return msgActionNak, 0
  }
  return msgActionAck, 0
}

case msgActionNakDelay:
  _ = msg.NakWithDelay(delay)

Common regression pattern

anti-pattern.go

// Intent-losing anti-pattern (do not do this)
if err := handler(&packet); err != nil {
  // %v stringifies and drops Unwrap chain + RetryDelay() contract
  return fmt.Errorf("handler failed: %v", err)
}

Note the subtlety: `fmt.Errorf("...: %w", err)` preserves unwrap semantics, while `fmt.Errorf("...: %v", err)` does not.

Validation runbook

Validate retry-intent continuity as a contract test, not only as unit behavior inside one package.

retry-intent-runbook.sh

bash

# 1) Contract test across package boundaries
# - create scheduler RetryAfter error
# - pass into bus RetryDelay
# - expect delay, ok=true

# 2) Transport mapping test
# - handler returns retryable error with 1500ms
# - expect processBusMsg => msgActionNakDelay + 1500ms

# 3) Wrapper safety test
# - wrap with fmt.Errorf("...: %w", err) => delay preserved
# - wrap with fmt.Errorf("...: %v", err) => delay lost

# 4) Runtime observability
# - watch logs for "bus: nak-with-delay"
# - sample delays and confirm they match scheduler policy classes

# 5) Refactor gate
# - block merges that alter RetryDelay contract without updating protocol docs

Limitations and tradeoffs

Approach	Upside	Downside
Interface contract (`RetryDelay()`)	Loose coupling across packages and transports.	Contract can be broken accidentally by unsafe wrapping or conversion.
Shared concrete retry error type	Stronger compile-time coupling and simpler detection.	Higher package coupling and harder long-term modularity.
String-based retry parsing	Fast to prototype.	Brittle, locale-sensitive, and unsafe for production semantics.

- Interface contracts reduce coupling, but they demand strict wrapper hygiene in all intermediate layers.
- I found package-level tests for retry behavior, but no explicit cross-package contract test that scheduler retry errors survive bus extraction paths.
- Transport logs expose `nak-with-delay`, but they do not prove policy correctness unless you compare delays against scheduler intent classes.

Next step

Implement this next:

1. Add a cross-package test: scheduler `RetryAfter` error must be detected by bus `RetryDelay`.
2. Add a regression test for wrapper behavior (`%w` allowed, `%v` loses intent).
3. Export per-delay-class metrics from scheduler and compare with observed `nak-with-delay` logs.
4. Treat retry contract changes as protocol changes and update docs in the same pull request.

Continue with AI Agent Run Lock Busy Retries and AI Agent Safety Unavailable Retry Strategy.

AI Agent Retry Intent Propagation