Name: Cordum
Author: Cordum

The production problem

Teams copy a timeout value from a sample config, then forget it. Six weeks later, policy latency rises and one of two bad outcomes appears: blocked queues or silent policy bypass.

Tight timeouts turn normal latency spikes into `SafetyUnavailable` storms. Loose timeouts keep workers waiting while backlog grows. Neither outcome is acceptable for autonomous systems with real side effects.

Timeout tuning is not a transport tweak. It is a governance control that defines when availability is allowed to outrank safety checks.

What top results miss

Source	Strong coverage	Missing piece
gRPC Deadlines guide	Deadline propagation mechanics and why callers should set deadlines intentionally.	No operation-level guidance for fail-open/fail-closed branches in AI dispatch loops.
Envoy ext_authz filter docs	Concrete authz timeout configuration and `failure_mode_allow` behavior.	No control-plane playbook for combining timeout budgets with scheduler requeue logic.
OPA Envoy performance docs	Benchmark scenarios and metrics (`end-to-end`, policy eval, handler cost).	No practical method for mapping benchmark latency to production timeout defaults by risk tier.

The gap is an end-to-end method: derive timeout budgets from measured latency, then bind those budgets to explicit fail-mode behavior and alerts.

Timeout budget model

Use one timeout policy per risk class, not one timeout for every request.

Operation path	Timeout target	Fail mode	Retry rule
Submit-time policy check (gateway)	400-800ms target p99	Closed for mutating endpoints; open only by incident override	At most one retry if deadline headroom remains >250ms
Pre-dispatch policy check (scheduler)	Start from Cordum 3s baseline, then tune per topic latency	`POLICY_CHECK_FAIL_MODE=closed` in production	Requeue with bounded delay when safety is unavailable
High-risk external side effects	1000-2000ms budget with extra tail reserve	Closed mandatory	Retry only with idempotency and strict attempt cap
Low-risk advisory paths	250-500ms	Open can be acceptable with paging on bypass metrics	Fast fail if budget exhausted

Budget formula

timeout_budget.txt

Text

# Safety check timeout budget (per operation class)
# inputs from observability over last 7 days:
# p99_eval_ms: policy evaluation p99
# p99_network_ms: transport + queueing p99
# jitter_ms: rollout jitter reserve
# margin_ms: explicit safety margin

timeout_ms = p99_eval_ms + p99_network_ms + jitter_ms + margin_ms

# Practical floor/ceiling guards
timeout_ms = max(timeout_ms, 250)
timeout_ms = min(timeout_ms, 3000)

# Example:
# p99_eval=420, p99_network=110, jitter=80, margin=140
# timeout = 750ms

Cordum runtime behavior

These numbers are current in Cordum code and docs. Tune around them, but do not ignore them during rollout planning.

Boundary	Current behavior	Operational impact
Scheduler safety check timeout	`safetyCheckTimeout = 3s` and `context.WithTimeout(...)` around safety check RPC.	Long policy calls are cut off deterministically instead of blocking worker loops.
Timeout outcome	On timeout, scheduler marks decision as `SafetyUnavailable` and logs a warning.	Behavior then follows configured input fail mode, not implicit retry loops.
Closed mode behavior	With `POLICY_CHECK_FAIL_MODE=closed` (default), unavailable checks requeue with backoff.	Availability drops during outage, but unsafe dispatch is avoided.
Open mode behavior	With `POLICY_CHECK_FAIL_MODE=open`, jobs proceed and `cordum_scheduler_input_fail_open_total` increments.	Throughput is preserved, but policy bypass risk rises with outage duration.
Store and lock operations	`storeOpTimeout = 2s` bounds Redis operations in scheduler code paths.	Prevents lock/store stalls from masking policy timeout issues.

Implementation examples

Adaptive timeout chooser (Go)

safety_timeout.go

func chooseSafetyTimeout(p99Eval, p99Network time.Duration, critical bool) time.Duration {
  jitterReserve := 80 * time.Millisecond
  margin := 140 * time.Millisecond
  base := p99Eval + p99Network + jitterReserve + margin

  if base < 250*time.Millisecond {
    base = 250 * time.Millisecond
  }

  if critical && base < 800*time.Millisecond {
    base = 800 * time.Millisecond
  }

  if base > 3*time.Second { // align with scheduler guardrail envelope
    base = 3 * time.Second
  }
  return base
}

Operational runbook

policy_timeout_runbook.sh

Bash

# 1) Watch for policy bypass during incident
sum(rate(cordum_scheduler_input_fail_open_total[5m])) by (topic)

# 2) Inspect timeout warnings from scheduler
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety check timed out|safety kernel unavailable"

# 3) Verify fail mode settings during outage
kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE
kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE

# 4) If bypass counter is rising in production, switch to closed and drain
kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed

Limitations and tradeoffs

- Higher timeout budgets reduce false timeouts but increase queue occupancy under policy latency spikes.
- Lower budgets reduce stall time but increase `SafetyUnavailable` frequency.
- Fail-open protects throughput but can bypass deny/approval decisions.
- Fail-closed preserves policy integrity but can degrade availability during kernel outages.

If `cordum_scheduler_input_fail_open_total` rises and nobody is paged, your control plane is running without effective pre-dispatch governance.

Next step

Do this in the next reliability cycle:

1. Export 7-day p95/p99 latency for policy evaluation per topic.
2. Set timeout budgets with the formula above and document risk-tier ownership.
3. Keep `POLICY_CHECK_FAIL_MODE=closed` for production unless incident commander approves override.
4. Alert on any non-zero 5-minute rate of `cordum_scheduler_input_fail_open_total`.

Continue with AI Agent Fail-Open vs Fail-Closed and AI Agent gRPC Deadline Budgeting.

AI Agent Safety Check Timeout Tuning