Name: Cordum
Author: Cordum

The production problem

During safety outages, every scheduler replica asks the same question: retry now or wait. If they all retry with the same cadence, they can hit the recovering dependency in synchronized waves.

This can delay recovery and inflate queue latency. The problem is not only service health. It is retry coordination under partial failure.

For governance systems, retry behavior also changes policy outcomes because fail-open mode can bypass checks while dependency health is uncertain.

What top results miss

Source	Strong coverage	Missing piece
AWS Builders Library: backoff with jitter	Why retries amplify overload and why jitter spreads retry bursts.	No pre-dispatch governance decision path tied to outage retry behavior.
Google Cloud retry strategy	Retryability by status code, idempotency boundaries, anti-patterns, and jitter guidance.	No distributed scheduler semantics for safety-check outages and bypass labeling.
Resilience4j Retry docs	Retry config knobs: max attempts, wait duration, interval functions, exponential/randomized intervals.	No integration model for safety policy engines where unavailable checks change dispatch outcomes.

The gap is control-plane semantics: how retry delay strategy intersects with fail-open policy and bypass observability.

Retry model comparison

Model	Advantage	Risk
Fixed delay (5s)	Simple, predictable, easy to reason about	Synchronized retries across replicas can re-hit dependency in waves
Exponential + jitter	Reduces correlated retry spikes during partial recovery	Higher tail latency and more tuning complexity
Fail-open (no requeue wait)	Maintains throughput when safety dependency is down	Allows policy bypass and needs strict observability and time limits

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Safety unavailable requeue	Engine returns `RetryAfter(..., 5s)` when `SafetyUnavailable` and fail mode is closed.	Controls retry pressure but uses uniform delay across replicas.
Fail-open branch	When input fail mode is open, decision is forced to allow and bypass labels are attached.	Preserves flow during outage while marking governance bypass context.
Bypass telemetry	Fail-open increments `cordum:scheduler:input_fail_open_total` counter path.	Provides explicit signal for alerting and post-incident review.
Backoff helper	`backoffDelay` implements `min(base*2^attempt + jitter, max)` with jitter from crypto RNG.	Existing primitive can reduce synchronized retry behavior when integrated.
Backoff constants	Base `1s`, max `30s`, jitter max `500ms` in scheduler backoff utility.	Gives bounded adaptive delay model for non-safety-unavailable retry paths.

Implementation examples

Current fixed-delay unavailable path (Go)

safety_unavailable_fixed_delay.go

const safetyThrottleDelay = 5 * time.Second

case SafetyUnavailable:
  if e.isInputFailOpen() {
    record.Decision = SafetyAllow
    record.Reason = "fail-open: safety unavailable -- " + record.Reason
  } else {
    return RetryAfter(fmt.Errorf("safety unavailable: %s", record.Reason), safetyThrottleDelay)
  }

Existing jittered backoff utility (Go)

scheduler_backoff_with_jitter.go

const (
  backoffBase      = 1 * time.Second
  backoffMax       = 30 * time.Second
  backoffJitterMax = 500 * time.Millisecond
)

func backoffDelay(attempt int, base, maxDelay time.Duration) time.Duration {
  delay := base << attempt
  if delay > maxDelay || delay <= 0 {
    delay = maxDelay
  }
  jitter := cryptoJitter(backoffJitterMax)
  total := delay + jitter
  if total > maxDelay {
    total = maxDelay
  }
  return total
}

Outage retry runbook baseline

safety_unavailable_retry_runbook.sh

Bash

# 1) Keep default fail-closed for production
export POLICY_CHECK_FAIL_MODE=closed

# 2) Alert on bypass activity if fail-open is ever enabled
# metric: cordum_scheduler_input_fail_open_total

# 3) During outage simulation, watch retry pressure and queue growth
# - Safety unavailable retry delay currently fixed at 5s
# - Compare against jittered backoff experiment in staging

# 4) If enabling fail-open, set explicit expiry and rollback time
# (for example: 30-minute emergency window)

Limitations and tradeoffs

- Fixed delays are easier to audit, but can create correlated retry bursts.
- Jitter improves recovery smoothness, but complicates exact retry timing expectations.
- Fail-open protects throughput, but weakens strict pre-dispatch safety guarantees.
- Retry policy changes should be validated with queue, latency, and bypass metrics together.

If every replica retries at exactly `5s`, dependency recovery can look worse than the underlying incident because load re-arrives in lockstep.

Next step

Run a controlled outage experiment this week:

1. Simulate Safety Kernel unavailability and record queue lag under fixed `5s` retry.
2. Compare with jittered backoff in staging to measure retry burst flattening.
3. Keep fail-closed default for production and define formal approval for temporary fail-open.
4. Alert on `input_fail_open_total` and include bypass labels in incident timelines.

Continue with Safety Circuit Breaker Tuning and Fail-Open Alerting.

AI Agent Safety Unavailable Retry Strategy