Name: Cordum
Author: Cordum

The production problem

One caller sets a 1-second deadline. Downstream services each use their own default timeout. Retries kick in without checking remaining budget. That is how a single slow hop becomes a full-chain timeout event.

Most teams notice this only after rollout or traffic spikes because median latency looks fine. Tail latency quietly consumes the budget.

Deadline strategy has to be explicit across the call graph.

What top results miss

Source	Strong coverage	Missing piece
gRPC deadlines guide	Deadline propagation concepts and why callers should set deadlines.	No concrete budget split strategy for multi-hop AI control planes.
gRPC retry guide	Retry primitives and policy controls.	No production method for retry gating on remaining deadline budget.
gRPC status codes	Meaning of `DEADLINE_EXCEEDED`, `UNAVAILABLE`, and related outcomes.	No operation-level decision matrix combining status code + budget state.

The gap is budget allocation discipline: how much time each hop gets and when retries are still legal.

Deadline budget math

deadline_budget.txt

Text

# End-to-end deadline budget example
# Caller deadline: 1200ms

ingress_budget = 480ms     # 40%
core_rpc_budget = 540ms    # 45%
retry_reserve = 120ms      # 10%
response_tail = 60ms       # 5%

# Retry rule:
# Only retry if remaining_deadline > core_rpc_min + jitter_reserve
# Example threshold: remaining_deadline > 220ms

Stage	Budget share	Purpose	Failure if mis-sized
Ingress handler	40%	Auth, validation, routing, initial policy checks.	Slow preflight burns downstream budget before core work starts.
Core dependency RPC	45%	Main business call (for example policy check, context fetch, write path).	Insufficient budget forces immediate `DEADLINE_EXCEEDED` under normal latency tail.
Retry reserve	10%	One bounded retry with jitter for transient transport failures.	No reserve means retries violate caller deadline and worsen load.
Response/cleanup	5%	Marshal response, metrics flush, and final state transition.	No tail reserve causes success-path calls to timeout at response edge.

Cordum timeout baseline

These values are live in the current codebase and are useful anchors for budget design.

Boundary	Current behavior	Operational impact
Scheduler store operations	`storeOpTimeout = 2s` for many lock/store interactions in scheduler engine.	Keeps internal lock/store calls bounded under contention.
Scheduler safety checks	`safetyCheckTimeout = 3s` for pre-dispatch policy evaluation path.	Prevents long policy stalls from blocking scheduler worker loops.
Workflow handler budget	Workflow result handling uses a 30s handler context timeout.	Longer path budget for workflow step completion and state updates.
Service shutdown envelope	Core services drain gracefully within 15s during SIGTERM windows.	Deadlines longer than shutdown window need caller-side retry/continuation logic.

Implementation examples

Remaining-budget-aware retry (Go)

deadline_retry.go

func withDeadlineRetry(ctx context.Context, req *pb.PolicyCheckRequest, c pb.SafetyKernelClient) (*pb.PolicyCheckResponse, error) {
  attempts := 0
  for {
    attempts++
    resp, err := c.Check(ctx, req)
    if err == nil {
      return resp, nil
    }

    st, ok := status.FromError(err)
    if !ok {
      return nil, err
    }

    if st.Code() != codes.Unavailable && st.Code() != codes.DeadlineExceeded {
      return nil, err
    }

    dl, hasDL := ctx.Deadline()
    if !hasDL || attempts >= 2 {
      return nil, err
    }

    remaining := time.Until(dl)
    if remaining < 220*time.Millisecond { // budget floor for one safe retry
      return nil, err
    }

    jitter := time.Duration(rand.Int63n(int64(40 * time.Millisecond)))
    time.Sleep(80*time.Millisecond + jitter)
  }
}

Deadline drift runbook

deadline_runbook.sh

Bash

# Check timeout-related errors during rollout
kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "DEADLINE_EXCEEDED|UNAVAILABLE|CANCELLED"
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety check timeout|storeOpTimeout|retry"

# Trigger controlled rollout
kubectl rollout restart deployment/cordum-api-gateway -n cordum
kubectl rollout status deployment/cordum-api-gateway -n cordum

# Confirm caller budgets and retry counts via metrics/logs
curl -s http://localhost:9092/metrics | grep -E "grpc|timeout|retry"

Limitations and tradeoffs

- Larger deadlines improve success under tail latency but can hide slow dependency regressions.
- Smaller deadlines protect upstream latency SLOs but increase timeout error rate during spikes.
- Retry reserves improve resilience but consume budget that could be used by primary execution.
- Strict fail-fast logic lowers blast radius but may reject recoverable transient calls.

If you cannot explain where each millisecond goes, your deadline policy is probably guessing in production.

Next step

Run this in one sprint:

1. Trace one critical request path and list every hop with p95 and p99 latency.
2. Assign deadline shares explicitly and codify them in code comments and config.
3. Add guard that blocks retries when remaining deadline drops below safety floor.
4. Run rollout drill and confirm timeout/retry counters stay inside error budget.

Continue with AI Agent gRPC CANCELLED and UNAVAILABLE and AI Agent Lock TTL Tuning.

AI Agent gRPC Deadline Budgeting