The production problem
Teams copy a timeout value from a sample config, then forget it. Six weeks later, policy latency rises and one of two bad outcomes appears: blocked queues or silent policy bypass.
Tight timeouts turn normal latency spikes into `SafetyUnavailable` storms. Loose timeouts keep workers waiting while backlog grows. Neither outcome is acceptable for autonomous systems with real side effects.
Timeout tuning is not a transport tweak. It is a governance control that defines when availability is allowed to outrank safety checks.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC Deadlines guide | Deadline propagation mechanics and why callers should set deadlines intentionally. | No operation-level guidance for fail-open/fail-closed branches in AI dispatch loops. |
| Envoy ext_authz filter docs | Concrete authz timeout configuration and `failure_mode_allow` behavior. | No control-plane playbook for combining timeout budgets with scheduler requeue logic. |
| OPA Envoy performance docs | Benchmark scenarios and metrics (`end-to-end`, policy eval, handler cost). | No practical method for mapping benchmark latency to production timeout defaults by risk tier. |
The gap is an end-to-end method: derive timeout budgets from measured latency, then bind those budgets to explicit fail-mode behavior and alerts.
Timeout budget model
Use one timeout policy per risk class, not one timeout for every request.
| Operation path | Timeout target | Fail mode | Retry rule |
|---|---|---|---|
| Submit-time policy check (gateway) | 400-800ms target p99 | Closed for mutating endpoints; open only by incident override | At most one retry if deadline headroom remains >250ms |
| Pre-dispatch policy check (scheduler) | Start from Cordum 3s baseline, then tune per topic latency | `POLICY_CHECK_FAIL_MODE=closed` in production | Requeue with bounded delay when safety is unavailable |
| High-risk external side effects | 1000-2000ms budget with extra tail reserve | Closed mandatory | Retry only with idempotency and strict attempt cap |
| Low-risk advisory paths | 250-500ms | Open can be acceptable with paging on bypass metrics | Fast fail if budget exhausted |
Budget formula
# Safety check timeout budget (per operation class) # inputs from observability over last 7 days: # p99_eval_ms: policy evaluation p99 # p99_network_ms: transport + queueing p99 # jitter_ms: rollout jitter reserve # margin_ms: explicit safety margin timeout_ms = p99_eval_ms + p99_network_ms + jitter_ms + margin_ms # Practical floor/ceiling guards timeout_ms = max(timeout_ms, 250) timeout_ms = min(timeout_ms, 3000) # Example: # p99_eval=420, p99_network=110, jitter=80, margin=140 # timeout = 750ms
Cordum runtime behavior
These numbers are current in Cordum code and docs. Tune around them, but do not ignore them during rollout planning.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Scheduler safety check timeout | `safetyCheckTimeout = 3s` and `context.WithTimeout(...)` around safety check RPC. | Long policy calls are cut off deterministically instead of blocking worker loops. |
| Timeout outcome | On timeout, scheduler marks decision as `SafetyUnavailable` and logs a warning. | Behavior then follows configured input fail mode, not implicit retry loops. |
| Closed mode behavior | With `POLICY_CHECK_FAIL_MODE=closed` (default), unavailable checks requeue with backoff. | Availability drops during outage, but unsafe dispatch is avoided. |
| Open mode behavior | With `POLICY_CHECK_FAIL_MODE=open`, jobs proceed and `cordum_scheduler_input_fail_open_total` increments. | Throughput is preserved, but policy bypass risk rises with outage duration. |
| Store and lock operations | `storeOpTimeout = 2s` bounds Redis operations in scheduler code paths. | Prevents lock/store stalls from masking policy timeout issues. |
Implementation examples
Adaptive timeout chooser (Go)
func chooseSafetyTimeout(p99Eval, p99Network time.Duration, critical bool) time.Duration {
jitterReserve := 80 * time.Millisecond
margin := 140 * time.Millisecond
base := p99Eval + p99Network + jitterReserve + margin
if base < 250*time.Millisecond {
base = 250 * time.Millisecond
}
if critical && base < 800*time.Millisecond {
base = 800 * time.Millisecond
}
if base > 3*time.Second { // align with scheduler guardrail envelope
base = 3 * time.Second
}
return base
}Operational runbook
# 1) Watch for policy bypass during incident sum(rate(cordum_scheduler_input_fail_open_total[5m])) by (topic) # 2) Inspect timeout warnings from scheduler kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety check timed out|safety kernel unavailable" # 3) Verify fail mode settings during outage kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE # 4) If bypass counter is rising in production, switch to closed and drain kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed
Limitations and tradeoffs
- - Higher timeout budgets reduce false timeouts but increase queue occupancy under policy latency spikes.
- - Lower budgets reduce stall time but increase `SafetyUnavailable` frequency.
- - Fail-open protects throughput but can bypass deny/approval decisions.
- - Fail-closed preserves policy integrity but can degrade availability during kernel outages.
If `cordum_scheduler_input_fail_open_total` rises and nobody is paged, your control plane is running without effective pre-dispatch governance.
Next step
Do this in the next reliability cycle:
- 1. Export 7-day p95/p99 latency for policy evaluation per topic.
- 2. Set timeout budgets with the formula above and document risk-tier ownership.
- 3. Keep `POLICY_CHECK_FAIL_MODE=closed` for production unless incident commander approves override.
- 4. Alert on any non-zero 5-minute rate of `cordum_scheduler_input_fail_open_total`.
Continue with AI Agent Fail-Open vs Fail-Closed and AI Agent gRPC Deadline Budgeting.