The production problem
Capacity planning failures in agent systems rarely look like one big crash. They show up as rising dispatch latency, retry churn, and deferred work that never catches up.
Most teams still size for average traffic and rely on autoscaling to save them. That works until retries, policy outages, or long-tail job durations break the scaling signal.
You need an explicit model that links worker count to reliability metrics and recovery behavior.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Google SRE Book: Demand forecasting and capacity planning | Strong requirement for demand forecast, load testing, and provisioning ownership. | No guidance for policy-gated autonomous workflows with retries and replay behavior. |
| Google SRE Workbook: Data processing capacity planning | Concrete example: provision around 50% CPU at peak and beware runaway autoscaling. | No generic model for agent pipeline stages (dispatch, policy, output checks). |
| AWS Well-Architected Analytics Lens BP 11.2 | Practical right-sizing and autoscaling guidance for predictable and spiky workloads. | No reliability budgeting link between scaling behavior and autonomous side-effect safety. |
Capacity model
Use queueing math as baseline, then add retry and safety-path headroom. Do not jump straight to autoscaler tuning.
| Model layer | Formula | Target value | Why it matters |
|---|---|---|---|
| Ingress rate | jobs_per_second (lambda) | Use p95 traffic, not daily average | Average hides burst pressure. |
| Service time | avg_execution_seconds (W) | Use p90 service time for conservative sizing | Long-tail jobs distort capacity fast. |
| Worker count | ceil((lambda * W) / target_utilization) | Target utilization 0.60-0.75 | Lower target gives burst headroom. |
| Retry overhead | base_workers * retry_multiplier | Start with 1.10 to 1.30 multiplier | Retry storms are real capacity demand. |
Sample worker sizing outcomes:
| Ingress rate | Service time | Utilization target | Required workers |
|---|---|---|---|
| 80 jobs/s | 0.35s | 0.65 | 44 |
| 120 jobs/s | 0.40s | 0.65 | 74 |
| 200 jobs/s | 0.55s | 0.70 | 158 |
Cordum runtime implications
| Implication | Current behavior | Why it matters |
|---|---|---|
| Failure-rate guardrail | Existing alert threshold uses failed ratio > 10% over 5m | Capacity decisions should reduce this sustained risk signal, not only reduce queue depth. |
| Latency guardrail | Dispatch p99 warning threshold is > 1s | A useful early signal that worker pools are under-provisioned. |
| Retry budget pressure | Max scheduling retries = 50, backoff 1s-30s, `retryDelayNoWorkers` = 2s | Retry mechanics directly affect effective throughput and backlog shape. |
| Policy dependency capacity impact | `POLICY_CHECK_FAIL_MODE=closed` defaults to requeue on policy outage | Policy outages can consume capacity through safe requeue loops. |
| Recovery debt tracking | `cordum_scheduler_stale_jobs` + `cordum_scheduler_orphan_replayed_total` | Capacity planning should include post-incident recovery window, not only steady-state traffic. |
Implementation examples
Worker sizing helper (TypeScript)
type SizingInput = {
ingressPerSecond: number; // lambda
avgServiceSeconds: number; // W
targetUtilization: number; // e.g. 0.65
retryMultiplier?: number; // e.g. 1.2
};
export function requiredWorkers(input: SizingInput): number {
const base = (input.ingressPerSecond * input.avgServiceSeconds) / input.targetUtilization;
const retries = input.retryMultiplier ?? 1.0;
return Math.ceil(base * retries);
}
// Example:
// 120 jobs/s * 0.40s / 0.65 = 73.8 -> 74 workers
// retry multiplier 1.2 -> 89 workersCapacity planning policy config (YAML)
capacity_planning:
target_utilization: 0.65
retry_multiplier: 1.2
guardrails:
dispatch_p99_seconds_warn: 1
failed_ratio_5m_warn: 0.10
stale_jobs_warn: 50
policy_dependency:
fail_mode: closed
max_tolerated_safety_unavailable_rate_5m: 0.05
headroom:
minimum_spare_workers_percent: 20
burst_window_minutes: 15Core capacity validation queries (PromQL)
# Dispatch p99
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))
# Failed completion ratio
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
# Safety dependency degradation
rate(cordum_safety_unavailable_total[5m])
# Recovery debt
cordum_scheduler_stale_jobs
rate(cordum_scheduler_orphan_replayed_total[5m])Limitations and tradeoffs
- - Simple sizing formulas assume stationarity; real workloads can shift faster than planning windows.
- - Conservative utilization targets increase reliability but can reduce cost efficiency.
- - Retry multipliers are rough estimates until measured under incident-like conditions.
- - Autoscaling can still overshoot when its metric no longer tracks useful work.
Next step
Run this in one sprint:
- 1. Baseline p95 ingress and p90 service time for your top three topics.
- 2. Compute worker targets with utilization 0.65 and retry multiplier 1.2.
- 3. Add guardrails for dispatch p99, failed ratio, and stale jobs.
- 4. Validate the plan with one controlled load test and one dependency-degradation drill.
Continue with AI Agent Chaos Engineering Playbook and AI Agent Backpressure and Queue Drain Strategy.