The production problem
One caller sets a 1-second deadline. Downstream services each use their own default timeout. Retries kick in without checking remaining budget. That is how a single slow hop becomes a full-chain timeout event.
Most teams notice this only after rollout or traffic spikes because median latency looks fine. Tail latency quietly consumes the budget.
Deadline strategy has to be explicit across the call graph.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC deadlines guide | Deadline propagation model and guidance to set deadlines on client calls. | No concrete hop-by-hop budget split for ingress, core call, retry reserve, and response tail. |
| gRPC retry guide | Retry primitives, policy controls, and retry-throttling behavior. | No production rule for blocking retries when remaining deadline drops below a safety floor. |
| Timeouts, Retries, and Backoff playbook | Practical timeout budget model, retry decision table, and jitter backoff patterns. | No codebase-verified timeout anchors for Cordum services and no rollout runbook for control-plane paths. |
The gap is budget allocation discipline: how much time each hop gets and when retries are still legal.
Deadline budget math
# End-to-end deadline budget example # Caller deadline: 1200ms ingress_budget = 480ms # 40% core_rpc_budget = 540ms # 45% retry_reserve = 120ms # 10% response_tail = 60ms # 5% # Retry rule: # Only retry if remaining_deadline > core_rpc_min + jitter_reserve # Example threshold: remaining_deadline > 220ms
| Stage | Budget share | Purpose | Failure if mis-sized |
|---|---|---|---|
| Ingress handler | 40% | Auth, validation, routing, initial policy checks. | Slow preflight burns downstream budget before core work starts. |
| Core dependency RPC | 45% | Main business call (for example policy check, context fetch, write path). | Insufficient budget forces immediate `DEADLINE_EXCEEDED` under normal latency tail. |
| Retry reserve | 10% | One bounded retry with jitter for transient transport failures. | No reserve means retries violate caller deadline and worsen load. |
| Response/cleanup | 5% | Marshal response, metrics flush, and final state transition. | No tail reserve causes success-path calls to timeout at response edge. |
Cordum timeout baseline
These values are live in the current codebase and are useful anchors for budget design.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Scheduler store operations | `storeOpTimeout = 2s` for many lock/store interactions in scheduler engine. | Keeps internal lock/store calls bounded under contention. |
| Scheduler safety checks | `safetyCheckTimeout = 3s` for pre-dispatch policy evaluation path. | Prevents long policy stalls from blocking scheduler worker loops. |
| Workflow handler budget | Workflow result handling uses a 30s handler context timeout. | Longer path budget for workflow step completion and state updates. |
| Service shutdown envelope | Core services drain gracefully within 15s during SIGTERM windows. | Deadlines longer than shutdown window need caller-side retry/continuation logic. |
Implementation examples
Remaining-budget-aware retry (Go)
func withDeadlineRetry(ctx context.Context, req *pb.PolicyCheckRequest, c pb.SafetyKernelClient) (*pb.PolicyCheckResponse, error) {
attempts := 0
for {
attempts++
resp, err := c.Check(ctx, req)
if err == nil {
return resp, nil
}
st, ok := status.FromError(err)
if !ok {
return nil, err
}
if st.Code() != codes.Unavailable && st.Code() != codes.DeadlineExceeded {
return nil, err
}
dl, hasDL := ctx.Deadline()
if !hasDL || attempts >= 2 {
return nil, err
}
remaining := time.Until(dl)
if remaining < 220*time.Millisecond { // budget floor for one safe retry
return nil, err
}
jitter := time.Duration(rand.Int63n(int64(40 * time.Millisecond)))
time.Sleep(80*time.Millisecond + jitter)
}
}Deadline drift runbook
# Check timeout-related errors during rollout kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "DEADLINE_EXCEEDED|UNAVAILABLE|CANCELLED" kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety check timeout|storeOpTimeout|retry" # Trigger controlled rollout kubectl rollout restart deployment/cordum-api-gateway -n cordum kubectl rollout status deployment/cordum-api-gateway -n cordum # Confirm caller budgets and retry counts via metrics/logs curl -s http://localhost:9092/metrics | grep -E "grpc|timeout|retry"
Limitations and tradeoffs
- - Larger deadlines improve success under tail latency but can hide slow dependency regressions.
- - Smaller deadlines protect upstream latency SLOs but increase timeout error rate during spikes.
- - Retry reserves improve resilience but consume budget that could be used by primary execution.
- - Strict fail-fast logic lowers blast radius but may reject recoverable transient calls.
If you cannot explain where each millisecond goes, your deadline policy is probably guessing in production.
Next step
Run this in one sprint:
- 1. Trace one critical request path and list every hop with p95 and p99 latency.
- 2. Assign deadline shares explicitly and codify them in code comments and config.
- 3. Add guard that blocks retries when remaining deadline drops below safety floor.
- 4. Run rollout drill and confirm timeout/retry counters stay inside error budget.
Continue with AI Agent gRPC CANCELLED and UNAVAILABLE and AI Agent Lock TTL Tuning.