Skip to content
Deep Dive

AI Agent Safety Unavailable Retry Strategy

Retry policy decides whether outage pain is concentrated or amplified.

Deep Dive10 min readMar 2026
TL;DR
  • -Safety service outages are retry-policy events, not only availability events.
  • -Cordum currently requeues `SafetyUnavailable` with a fixed `5s` delay in fail-closed mode.
  • -Cordum also has jittered exponential backoff primitives (`1s` base, `30s` max, `500ms` jitter) for other retry paths.
  • -Fail-open mode avoids requeue waits but must be time-boxed and monitored as a governance bypass condition.
Fixed delay path

`safetyThrottleDelay = 5s` is used for safety unavailable requeue in fail-closed mode.

Jitter exists

Scheduler backoff utility already supports exponential backoff with random jitter.

Policy effect

`POLICY_CHECK_FAIL_MODE=open` trades strict pre-dispatch checks for continuity under outage.

Scope

This guide focuses on scheduler behavior when Safety Kernel checks are unavailable. It compares fixed delay and jittered backoff using current Cordum implementation details.

The production problem

During safety outages, every scheduler replica asks the same question: retry now or wait. If they all retry with the same cadence, they can hit the recovering dependency in synchronized waves.

This can delay recovery and inflate queue latency. The problem is not only service health. It is retry coordination under partial failure.

For governance systems, retry behavior also changes policy outcomes because fail-open mode can bypass checks while dependency health is uncertain.

What top results miss

SourceStrong coverageMissing piece
AWS Builders Library: backoff with jitterWhy retries amplify overload and why jitter spreads retry bursts.No pre-dispatch governance decision path tied to outage retry behavior.
Google Cloud retry strategyRetryability by status code, idempotency boundaries, anti-patterns, and jitter guidance.No distributed scheduler semantics for safety-check outages and bypass labeling.
Resilience4j Retry docsRetry config knobs: max attempts, wait duration, interval functions, exponential/randomized intervals.No integration model for safety policy engines where unavailable checks change dispatch outcomes.

The gap is control-plane semantics: how retry delay strategy intersects with fail-open policy and bypass observability.

Retry model comparison

ModelAdvantageRisk
Fixed delay (5s)Simple, predictable, easy to reason aboutSynchronized retries across replicas can re-hit dependency in waves
Exponential + jitterReduces correlated retry spikes during partial recoveryHigher tail latency and more tuning complexity
Fail-open (no requeue wait)Maintains throughput when safety dependency is downAllows policy bypass and needs strict observability and time limits

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Safety unavailable requeueEngine returns `RetryAfter(..., 5s)` when `SafetyUnavailable` and fail mode is closed.Controls retry pressure but uses uniform delay across replicas.
Fail-open branchWhen input fail mode is open, decision is forced to allow and bypass labels are attached.Preserves flow during outage while marking governance bypass context.
Bypass telemetryFail-open increments `cordum:scheduler:input_fail_open_total` counter path.Provides explicit signal for alerting and post-incident review.
Backoff helper`backoffDelay` implements `min(base*2^attempt + jitter, max)` with jitter from crypto RNG.Existing primitive can reduce synchronized retry behavior when integrated.
Backoff constantsBase `1s`, max `30s`, jitter max `500ms` in scheduler backoff utility.Gives bounded adaptive delay model for non-safety-unavailable retry paths.

Implementation examples

Current fixed-delay unavailable path (Go)

safety_unavailable_fixed_delay.go
Go
const safetyThrottleDelay = 5 * time.Second

case SafetyUnavailable:
  if e.isInputFailOpen() {
    record.Decision = SafetyAllow
    record.Reason = "fail-open: safety unavailable -- " + record.Reason
  } else {
    return RetryAfter(fmt.Errorf("safety unavailable: %s", record.Reason), safetyThrottleDelay)
  }

Existing jittered backoff utility (Go)

scheduler_backoff_with_jitter.go
Go
const (
  backoffBase      = 1 * time.Second
  backoffMax       = 30 * time.Second
  backoffJitterMax = 500 * time.Millisecond
)

func backoffDelay(attempt int, base, maxDelay time.Duration) time.Duration {
  delay := base << attempt
  if delay > maxDelay || delay <= 0 {
    delay = maxDelay
  }
  jitter := cryptoJitter(backoffJitterMax)
  total := delay + jitter
  if total > maxDelay {
    total = maxDelay
  }
  return total
}

Outage retry runbook baseline

safety_unavailable_retry_runbook.sh
Bash
# 1) Keep default fail-closed for production
export POLICY_CHECK_FAIL_MODE=closed

# 2) Alert on bypass activity if fail-open is ever enabled
# metric: cordum_scheduler_input_fail_open_total

# 3) During outage simulation, watch retry pressure and queue growth
# - Safety unavailable retry delay currently fixed at 5s
# - Compare against jittered backoff experiment in staging

# 4) If enabling fail-open, set explicit expiry and rollback time
# (for example: 30-minute emergency window)

Limitations and tradeoffs

  • - Fixed delays are easier to audit, but can create correlated retry bursts.
  • - Jitter improves recovery smoothness, but complicates exact retry timing expectations.
  • - Fail-open protects throughput, but weakens strict pre-dispatch safety guarantees.
  • - Retry policy changes should be validated with queue, latency, and bypass metrics together.

If every replica retries at exactly `5s`, dependency recovery can look worse than the underlying incident because load re-arrives in lockstep.

Next step

Run a controlled outage experiment this week:

  1. 1. Simulate Safety Kernel unavailability and record queue lag under fixed `5s` retry.
  2. 2. Compare with jittered backoff in staging to measure retry burst flattening.
  3. 3. Keep fail-closed default for production and define formal approval for temporary fail-open.
  4. 4. Alert on `input_fail_open_total` and include bypass labels in incident timelines.

Continue with Safety Circuit Breaker Tuning and Fail-Open Alerting.

Retry policy is load policy

During outages, your delay function shapes recovery traffic more than your incident notes do.