Skip to content
Guide

AI Agent Safety Check Timeout Tuning

Timeouts are policy decisions. Treat them like governance, not a magic constant.

Guide11 min readMar 2026
TL;DR
  • -A single global timeout is usually wrong for safety checks across mixed-risk operations.
  • -Cordum scheduler enforces a 3s safety check timeout and turns expired checks into `SafetyUnavailable`.
  • -In `POLICY_CHECK_FAIL_MODE=open`, unavailable checks are allowed through and counted in `cordum_scheduler_input_fail_open_total`.
  • -Timeout tuning only works if fail-open paths have explicit monitoring and escalation thresholds.
Budget by risk

Use different timeout envelopes for read-like, internal write, and external side-effect paths.

Fail mode discipline

Timeout handling must map to explicit fail-open or fail-closed policy, not ad hoc retries.

Bounded recovery

Use short requeue windows and alert on bypass counters before risk accumulates.

Scope

This guide focuses on pre-dispatch and submit-time safety checks in autonomous AI control planes where timeout values directly affect both policy integrity and queue stability.

The production problem

Teams copy a timeout value from a sample config, then forget it. Six weeks later, policy latency rises and one of two bad outcomes appears: blocked queues or silent policy bypass.

Tight timeouts turn normal latency spikes into `SafetyUnavailable` storms. Loose timeouts keep workers waiting while backlog grows. Neither outcome is acceptable for autonomous systems with real side effects.

Timeout tuning is not a transport tweak. It is a governance control that defines when availability is allowed to outrank safety checks.

What top results miss

SourceStrong coverageMissing piece
gRPC Deadlines guideDeadline propagation mechanics and why callers should set deadlines intentionally.No operation-level guidance for fail-open/fail-closed branches in AI dispatch loops.
Envoy ext_authz filter docsConcrete authz timeout configuration and `failure_mode_allow` behavior.No control-plane playbook for combining timeout budgets with scheduler requeue logic.
OPA Envoy performance docsBenchmark scenarios and metrics (`end-to-end`, policy eval, handler cost).No practical method for mapping benchmark latency to production timeout defaults by risk tier.

The gap is an end-to-end method: derive timeout budgets from measured latency, then bind those budgets to explicit fail-mode behavior and alerts.

Timeout budget model

Use one timeout policy per risk class, not one timeout for every request.

Operation pathTimeout targetFail modeRetry rule
Submit-time policy check (gateway)400-800ms target p99Closed for mutating endpoints; open only by incident overrideAt most one retry if deadline headroom remains >250ms
Pre-dispatch policy check (scheduler)Start from Cordum 3s baseline, then tune per topic latency`POLICY_CHECK_FAIL_MODE=closed` in productionRequeue with bounded delay when safety is unavailable
High-risk external side effects1000-2000ms budget with extra tail reserveClosed mandatoryRetry only with idempotency and strict attempt cap
Low-risk advisory paths250-500msOpen can be acceptable with paging on bypass metricsFast fail if budget exhausted

Budget formula

timeout_budget.txt
Text
# Safety check timeout budget (per operation class)
# inputs from observability over last 7 days:
# p99_eval_ms: policy evaluation p99
# p99_network_ms: transport + queueing p99
# jitter_ms: rollout jitter reserve
# margin_ms: explicit safety margin

timeout_ms = p99_eval_ms + p99_network_ms + jitter_ms + margin_ms

# Practical floor/ceiling guards
timeout_ms = max(timeout_ms, 250)
timeout_ms = min(timeout_ms, 3000)

# Example:
# p99_eval=420, p99_network=110, jitter=80, margin=140
# timeout = 750ms

Cordum runtime behavior

These numbers are current in Cordum code and docs. Tune around them, but do not ignore them during rollout planning.

BoundaryCurrent behaviorOperational impact
Scheduler safety check timeout`safetyCheckTimeout = 3s` and `context.WithTimeout(...)` around safety check RPC.Long policy calls are cut off deterministically instead of blocking worker loops.
Timeout outcomeOn timeout, scheduler marks decision as `SafetyUnavailable` and logs a warning.Behavior then follows configured input fail mode, not implicit retry loops.
Closed mode behaviorWith `POLICY_CHECK_FAIL_MODE=closed` (default), unavailable checks requeue with backoff.Availability drops during outage, but unsafe dispatch is avoided.
Open mode behaviorWith `POLICY_CHECK_FAIL_MODE=open`, jobs proceed and `cordum_scheduler_input_fail_open_total` increments.Throughput is preserved, but policy bypass risk rises with outage duration.
Store and lock operations`storeOpTimeout = 2s` bounds Redis operations in scheduler code paths.Prevents lock/store stalls from masking policy timeout issues.

Implementation examples

Adaptive timeout chooser (Go)

safety_timeout.go
Go
func chooseSafetyTimeout(p99Eval, p99Network time.Duration, critical bool) time.Duration {
  jitterReserve := 80 * time.Millisecond
  margin := 140 * time.Millisecond
  base := p99Eval + p99Network + jitterReserve + margin

  if base < 250*time.Millisecond {
    base = 250 * time.Millisecond
  }

  if critical && base < 800*time.Millisecond {
    base = 800 * time.Millisecond
  }

  if base > 3*time.Second { // align with scheduler guardrail envelope
    base = 3 * time.Second
  }
  return base
}

Operational runbook

policy_timeout_runbook.sh
Bash
# 1) Watch for policy bypass during incident
sum(rate(cordum_scheduler_input_fail_open_total[5m])) by (topic)

# 2) Inspect timeout warnings from scheduler
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety check timed out|safety kernel unavailable"

# 3) Verify fail mode settings during outage
kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE
kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE

# 4) If bypass counter is rising in production, switch to closed and drain
kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed

Limitations and tradeoffs

  • - Higher timeout budgets reduce false timeouts but increase queue occupancy under policy latency spikes.
  • - Lower budgets reduce stall time but increase `SafetyUnavailable` frequency.
  • - Fail-open protects throughput but can bypass deny/approval decisions.
  • - Fail-closed preserves policy integrity but can degrade availability during kernel outages.

If `cordum_scheduler_input_fail_open_total` rises and nobody is paged, your control plane is running without effective pre-dispatch governance.

Next step

Do this in the next reliability cycle:

  1. 1. Export 7-day p95/p99 latency for policy evaluation per topic.
  2. 2. Set timeout budgets with the formula above and document risk-tier ownership.
  3. 3. Keep `POLICY_CHECK_FAIL_MODE=closed` for production unless incident commander approves override.
  4. 4. Alert on any non-zero 5-minute rate of `cordum_scheduler_input_fail_open_total`.

Continue with AI Agent Fail-Open vs Fail-Closed and AI Agent gRPC Deadline Budgeting.

Timeouts are governance controls

Put explicit ownership on timeout and fail-mode decisions before the next outage chooses for you.