The production problem
During safety outages, every scheduler replica asks the same question: retry now or wait. If they all retry with the same cadence, they can hit the recovering dependency in synchronized waves.
This can delay recovery and inflate queue latency. The problem is not only service health. It is retry coordination under partial failure.
For governance systems, retry behavior also changes policy outcomes because fail-open mode can bypass checks while dependency health is uncertain.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Builders Library: backoff with jitter | Why retries amplify overload and why jitter spreads retry bursts. | No pre-dispatch governance decision path tied to outage retry behavior. |
| Google Cloud retry strategy | Retryability by status code, idempotency boundaries, anti-patterns, and jitter guidance. | No distributed scheduler semantics for safety-check outages and bypass labeling. |
| Resilience4j Retry docs | Retry config knobs: max attempts, wait duration, interval functions, exponential/randomized intervals. | No integration model for safety policy engines where unavailable checks change dispatch outcomes. |
The gap is control-plane semantics: how retry delay strategy intersects with fail-open policy and bypass observability.
Retry model comparison
| Model | Advantage | Risk |
|---|---|---|
| Fixed delay (5s) | Simple, predictable, easy to reason about | Synchronized retries across replicas can re-hit dependency in waves |
| Exponential + jitter | Reduces correlated retry spikes during partial recovery | Higher tail latency and more tuning complexity |
| Fail-open (no requeue wait) | Maintains throughput when safety dependency is down | Allows policy bypass and needs strict observability and time limits |
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Safety unavailable requeue | Engine returns `RetryAfter(..., 5s)` when `SafetyUnavailable` and fail mode is closed. | Controls retry pressure but uses uniform delay across replicas. |
| Fail-open branch | When input fail mode is open, decision is forced to allow and bypass labels are attached. | Preserves flow during outage while marking governance bypass context. |
| Bypass telemetry | Fail-open increments `cordum:scheduler:input_fail_open_total` counter path. | Provides explicit signal for alerting and post-incident review. |
| Backoff helper | `backoffDelay` implements `min(base*2^attempt + jitter, max)` with jitter from crypto RNG. | Existing primitive can reduce synchronized retry behavior when integrated. |
| Backoff constants | Base `1s`, max `30s`, jitter max `500ms` in scheduler backoff utility. | Gives bounded adaptive delay model for non-safety-unavailable retry paths. |
Implementation examples
Current fixed-delay unavailable path (Go)
const safetyThrottleDelay = 5 * time.Second
case SafetyUnavailable:
if e.isInputFailOpen() {
record.Decision = SafetyAllow
record.Reason = "fail-open: safety unavailable -- " + record.Reason
} else {
return RetryAfter(fmt.Errorf("safety unavailable: %s", record.Reason), safetyThrottleDelay)
}Existing jittered backoff utility (Go)
const (
backoffBase = 1 * time.Second
backoffMax = 30 * time.Second
backoffJitterMax = 500 * time.Millisecond
)
func backoffDelay(attempt int, base, maxDelay time.Duration) time.Duration {
delay := base << attempt
if delay > maxDelay || delay <= 0 {
delay = maxDelay
}
jitter := cryptoJitter(backoffJitterMax)
total := delay + jitter
if total > maxDelay {
total = maxDelay
}
return total
}Outage retry runbook baseline
# 1) Keep default fail-closed for production export POLICY_CHECK_FAIL_MODE=closed # 2) Alert on bypass activity if fail-open is ever enabled # metric: cordum_scheduler_input_fail_open_total # 3) During outage simulation, watch retry pressure and queue growth # - Safety unavailable retry delay currently fixed at 5s # - Compare against jittered backoff experiment in staging # 4) If enabling fail-open, set explicit expiry and rollback time # (for example: 30-minute emergency window)
Limitations and tradeoffs
- - Fixed delays are easier to audit, but can create correlated retry bursts.
- - Jitter improves recovery smoothness, but complicates exact retry timing expectations.
- - Fail-open protects throughput, but weakens strict pre-dispatch safety guarantees.
- - Retry policy changes should be validated with queue, latency, and bypass metrics together.
If every replica retries at exactly `5s`, dependency recovery can look worse than the underlying incident because load re-arrives in lockstep.
Next step
Run a controlled outage experiment this week:
- 1. Simulate Safety Kernel unavailability and record queue lag under fixed `5s` retry.
- 2. Compare with jittered backoff in staging to measure retry burst flattening.
- 3. Keep fail-closed default for production and define formal approval for temporary fail-open.
- 4. Alert on `input_fail_open_total` and include bypass labels in incident timelines.
Continue with Safety Circuit Breaker Tuning and Fail-Open Alerting.