The production problem
Teams copy a timeout value from a sample config, then forget it. Six weeks later, policy latency rises and one of two bad outcomes appears: blocked queues or silent policy bypass.
Tight timeouts turn normal latency spikes into `SafetyUnavailable` storms. Loose timeouts keep workers waiting while backlog grows. Neither outcome is acceptable for autonomous systems with real side effects.
Timeout tuning is not a transport tweak. It is a governance control that defines when availability is allowed to outrank safety checks.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC Deadlines guide | Deadline propagation mechanics and why callers should set deadlines intentionally. | No operation-level guidance for fail-open/fail-closed branches or multi-stage timeout caps in AI dispatch loops. |
| Envoy ext_authz filter docs | Concrete authz timeout configuration and `failure_mode_allow` behavior. | No control-plane playbook for combining timeout budgets with scheduler requeue logic and submit-vs-dispatch stage drift. |
| OPA Envoy performance docs | Benchmark scenarios and metrics (`end-to-end`, policy eval, handler cost). | No practical method for mapping benchmark latency to production timeout defaults by risk tier with nested timeout clamps. |
The gap is an end-to-end method: derive timeout budgets from measured latency, then bind those budgets to explicit fail-mode behavior and alerts.
Timeout budget model
Use one timeout policy per risk class, not one timeout for every request.
| Operation path | Timeout target | Fail mode | Retry rule |
|---|---|---|---|
| Submit-time policy check (gateway) | 400-800ms target p99 | Closed for mutating endpoints; open only by incident override | At most one retry if deadline headroom remains >250ms |
| Pre-dispatch policy check (scheduler) | Effective cap is 2s today (`min(3s scheduler outer, 2s safety client)`) | `POLICY_CHECK_FAIL_MODE=closed` in production | Requeue with bounded delay when safety is unavailable |
| Submit + dispatch combined safety path | Gateway submit eval uses 5s; dispatch check effective cap is 2s | Keep both stages closed in production unless incident override is explicit | Treat stage mismatch as drift and tune both paths together |
| High-risk external side effects | 1000-2000ms budget with extra tail reserve | Closed mandatory | Retry only with idempotency and strict attempt cap |
| Low-risk advisory paths | 250-500ms | Open can be acceptable with paging on bypass metrics | Fast fail if budget exhausted |
Budget formula
# Safety check timeout budget (per operation class) # inputs from observability over last 7 days: # p99_eval_ms: policy evaluation p99 # p99_network_ms: transport + queueing p99 # jitter_ms: rollout jitter reserve # margin_ms: explicit safety margin timeout_ms = p99_eval_ms + p99_network_ms + jitter_ms + margin_ms # Practical floor/ceiling guards timeout_ms = max(timeout_ms, 250) timeout_ms = min(timeout_ms, 3000) # stage caps in current Cordum code: # gateway submit-time policy evaluate: 5000ms # scheduler pre-dispatch outer guard: 3000ms # scheduler safety client inner cap: 2000ms effective_dispatch_timeout_ms = min(timeout_ms, 3000, 2000) # Example: # p99_eval=420, p99_network=110, jitter=80, margin=140 # timeout = 750ms
Cordum runtime behavior
These numbers are current in Cordum code and docs. Tune around them, but do not ignore them during rollout planning.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Gateway submit-time timeout | Gateway wraps policy `Evaluate` with `context.WithTimeout(..., 5s)` before persistence/publish. | Submit-time decisions can succeed while dispatch-time checks still time out if budgets are misaligned. |
| Scheduler outer safety timeout | Scheduler sets `safetyCheckTimeout = 3s` around the safety check path. | Dispatch loop has a hard upper guard, but inner clients can still cut earlier. |
| Safety client inner timeout | Safety client wraps requests with `context.WithTimeout(ctx, 2s)` before gRPC `Check`. | Effective dispatch-time safety timeout is 2s; inner timeout usually fires before the 3s outer guard. |
| Timeout outcome | On timeout, scheduler marks decision as `SafetyUnavailable` and logs a warning. | Behavior then follows configured input fail mode, not implicit retry loops. |
| Closed mode behavior | With `POLICY_CHECK_FAIL_MODE=closed` (default), unavailable checks requeue with backoff. | Availability drops during outage, but unsafe dispatch is avoided. |
| Open mode behavior | With `POLICY_CHECK_FAIL_MODE=open`, jobs proceed and `cordum_scheduler_input_fail_open_total` increments. | Throughput is preserved, but policy bypass risk rises with outage duration. |
| Store and lock operations | `storeOpTimeout = 2s` bounds Redis operations in scheduler code paths. | Prevents lock/store stalls from masking policy timeout issues. |
Implementation examples
Adaptive timeout chooser (Go)
func chooseSafetyTimeout(p99Eval, p99Network time.Duration, critical bool) time.Duration {
jitterReserve := 80 * time.Millisecond
margin := 140 * time.Millisecond
base := p99Eval + p99Network + jitterReserve + margin
if base < 250*time.Millisecond {
base = 250 * time.Millisecond
}
if critical && base < 800*time.Millisecond {
base = 800 * time.Millisecond
}
if base > 3*time.Second { // align with scheduler guardrail envelope
base = 3 * time.Second
}
return base
}Multi-stage timeout clamp (Go)
func effectiveDispatchSafetyTimeout(parent context.Context) time.Duration {
const (
schedulerOuterCap = 3 * time.Second
safetyClientCap = 2 * time.Second
minFloor = 250 * time.Millisecond
deadlineReserve = 150 * time.Millisecond
)
budget := minDuration(schedulerOuterCap, safetyClientCap)
if dl, ok := parent.Deadline(); ok {
remaining := time.Until(dl) - deadlineReserve
if remaining < budget {
budget = remaining
}
}
if budget < minFloor {
return minFloor
}
return budget
}
func minDuration(a, b time.Duration) time.Duration {
if a < b {
return a
}
return b
}Operational runbook
# 1) Watch for policy bypass during incident sum(rate(cordum_scheduler_input_fail_open_total[5m])) by (topic) # 2) Inspect timeout warnings from scheduler kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety check timed out|safety kernel unavailable" # 3) Verify fail mode settings during outage kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE # 4) If bypass counter is rising in production, switch to closed and drain kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed
Limitations and tradeoffs
- - Higher timeout budgets reduce false timeouts but increase queue occupancy under policy latency spikes.
- - Lower budgets reduce stall time but increase `SafetyUnavailable` frequency.
- - Fail-open protects throughput but can bypass deny/approval decisions.
- - Fail-closed preserves policy integrity but can degrade availability during kernel outages.
- - Split timeout caps between submit and dispatch checks can create hard-to-debug stage drift.
If `cordum_scheduler_input_fail_open_total` rises and nobody is paged, your control plane is running without effective pre-dispatch governance.
Next step
Do this in the next reliability cycle:
- 1. Export 7-day p95/p99 latency for policy evaluation per topic.
- 2. Set timeout budgets with the formula above and document risk-tier ownership.
- 3. Audit stage caps (`5s` gateway submit, `3s` scheduler outer, `2s` client inner) and align intentionally.
- 4. Keep `POLICY_CHECK_FAIL_MODE=closed` for production unless incident commander approves override.
- 5. Alert on any non-zero 5-minute rate of `cordum_scheduler_input_fail_open_total`.
Continue with AI Agent Fail-Open vs Fail-Closed and AI Agent gRPC Deadline Budgeting.