The production problem
One caller sets a 1-second deadline. Downstream services each use their own default timeout. Retries kick in without checking remaining budget. That is how a single slow hop becomes a full-chain timeout event.
Most teams notice this only after rollout or traffic spikes because median latency looks fine. Tail latency quietly consumes the budget.
Deadline strategy has to be explicit across the call graph.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC deadlines guide | Deadline propagation concepts and why callers should set deadlines. | No concrete budget split strategy for multi-hop AI control planes. |
| gRPC retry guide | Retry primitives and policy controls. | No production method for retry gating on remaining deadline budget. |
| gRPC status codes | Meaning of `DEADLINE_EXCEEDED`, `UNAVAILABLE`, and related outcomes. | No operation-level decision matrix combining status code + budget state. |
The gap is budget allocation discipline: how much time each hop gets and when retries are still legal.
Deadline budget math
# End-to-end deadline budget example # Caller deadline: 1200ms ingress_budget = 480ms # 40% core_rpc_budget = 540ms # 45% retry_reserve = 120ms # 10% response_tail = 60ms # 5% # Retry rule: # Only retry if remaining_deadline > core_rpc_min + jitter_reserve # Example threshold: remaining_deadline > 220ms
| Stage | Budget share | Purpose | Failure if mis-sized |
|---|---|---|---|
| Ingress handler | 40% | Auth, validation, routing, initial policy checks. | Slow preflight burns downstream budget before core work starts. |
| Core dependency RPC | 45% | Main business call (for example policy check, context fetch, write path). | Insufficient budget forces immediate `DEADLINE_EXCEEDED` under normal latency tail. |
| Retry reserve | 10% | One bounded retry with jitter for transient transport failures. | No reserve means retries violate caller deadline and worsen load. |
| Response/cleanup | 5% | Marshal response, metrics flush, and final state transition. | No tail reserve causes success-path calls to timeout at response edge. |
Cordum timeout baseline
These values are live in the current codebase and are useful anchors for budget design.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Scheduler store operations | `storeOpTimeout = 2s` for many lock/store interactions in scheduler engine. | Keeps internal lock/store calls bounded under contention. |
| Scheduler safety checks | `safetyCheckTimeout = 3s` for pre-dispatch policy evaluation path. | Prevents long policy stalls from blocking scheduler worker loops. |
| Workflow handler budget | Workflow result handling uses a 30s handler context timeout. | Longer path budget for workflow step completion and state updates. |
| Service shutdown envelope | Core services drain gracefully within 15s during SIGTERM windows. | Deadlines longer than shutdown window need caller-side retry/continuation logic. |
Implementation examples
Remaining-budget-aware retry (Go)
func withDeadlineRetry(ctx context.Context, req *pb.PolicyCheckRequest, c pb.SafetyKernelClient) (*pb.PolicyCheckResponse, error) {
attempts := 0
for {
attempts++
resp, err := c.Check(ctx, req)
if err == nil {
return resp, nil
}
st, ok := status.FromError(err)
if !ok {
return nil, err
}
if st.Code() != codes.Unavailable && st.Code() != codes.DeadlineExceeded {
return nil, err
}
dl, hasDL := ctx.Deadline()
if !hasDL || attempts >= 2 {
return nil, err
}
remaining := time.Until(dl)
if remaining < 220*time.Millisecond { // budget floor for one safe retry
return nil, err
}
jitter := time.Duration(rand.Int63n(int64(40 * time.Millisecond)))
time.Sleep(80*time.Millisecond + jitter)
}
}Deadline drift runbook
# Check timeout-related errors during rollout kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "DEADLINE_EXCEEDED|UNAVAILABLE|CANCELLED" kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety check timeout|storeOpTimeout|retry" # Trigger controlled rollout kubectl rollout restart deployment/cordum-api-gateway -n cordum kubectl rollout status deployment/cordum-api-gateway -n cordum # Confirm caller budgets and retry counts via metrics/logs curl -s http://localhost:9092/metrics | grep -E "grpc|timeout|retry"
Limitations and tradeoffs
- - Larger deadlines improve success under tail latency but can hide slow dependency regressions.
- - Smaller deadlines protect upstream latency SLOs but increase timeout error rate during spikes.
- - Retry reserves improve resilience but consume budget that could be used by primary execution.
- - Strict fail-fast logic lowers blast radius but may reject recoverable transient calls.
If you cannot explain where each millisecond goes, your deadline policy is probably guessing in production.
Next step
Run this in one sprint:
- 1. Trace one critical request path and list every hop with p95 and p99 latency.
- 2. Assign deadline shares explicitly and codify them in code comments and config.
- 3. Add guard that blocks retries when remaining deadline drops below safety floor.
- 4. Run rollout drill and confirm timeout/retry counters stay inside error budget.
Continue with AI Agent gRPC CANCELLED and UNAVAILABLE and AI Agent Lock TTL Tuning.