Name: Cordum
Author: Cordum

The production problem

Capacity planning failures in agent systems rarely look like one big crash. They show up as rising dispatch latency, retry churn, and deferred work that never catches up.

Most teams still size for average traffic and rely on autoscaling to save them. That works until retries, policy outages, or long-tail job durations break the scaling signal.

You need an explicit model that links worker count to reliability metrics and recovery behavior.

What top results miss

Source	Strong coverage	Missing piece
Google SRE Book: Demand forecasting and capacity planning	Strong requirement for demand forecast, load testing, and provisioning ownership.	No guidance for policy-gated autonomous workflows with retries and replay behavior.
Google SRE Workbook: Data processing capacity planning	Concrete example: provision around 50% CPU at peak and beware runaway autoscaling.	No generic model for agent pipeline stages (dispatch, policy, output checks).
AWS Well-Architected Analytics Lens BP 11.2	Practical right-sizing and autoscaling guidance for predictable and spiky workloads.	No reliability budgeting link between scaling behavior and autonomous side-effect safety.

Capacity model

Use queueing math as baseline, then add retry and safety-path headroom. Do not jump straight to autoscaler tuning.

Model layer	Formula	Target value	Why it matters
Ingress rate	jobs_per_second (lambda)	Use p95 traffic, not daily average	Average hides burst pressure.
Service time	avg_execution_seconds (W)	Use p90 service time for conservative sizing	Long-tail jobs distort capacity fast.
Worker count	ceil((lambda * W) / target_utilization)	Target utilization 0.60-0.75	Lower target gives burst headroom.
Retry overhead	base_workers * retry_multiplier	Start with 1.10 to 1.30 multiplier	Retry storms are real capacity demand.

Sample worker sizing outcomes:

Ingress rate	Service time	Utilization target	Required workers
80 jobs/s	0.35s	0.65	44
120 jobs/s	0.40s	0.65	74
200 jobs/s	0.55s	0.70	158

Cordum runtime implications

Implication	Current behavior	Why it matters
Failure-rate guardrail	Existing alert threshold uses failed ratio > 10% over 5m	Capacity decisions should reduce this sustained risk signal, not only reduce queue depth.
Latency guardrail	Dispatch p99 warning threshold is > 1s	A useful early signal that worker pools are under-provisioned.
Retry budget pressure	Max scheduling retries = 50, backoff 1s-30s, `retryDelayNoWorkers` = 2s	Retry mechanics directly affect effective throughput and backlog shape.
Policy dependency capacity impact	`POLICY_CHECK_FAIL_MODE=closed` defaults to requeue on policy outage	Policy outages can consume capacity through safe requeue loops.
Recovery debt tracking	`cordum_scheduler_stale_jobs` + `cordum_scheduler_orphan_replayed_total`	Capacity planning should include post-incident recovery window, not only steady-state traffic.

Implementation examples

Worker sizing helper (TypeScript)

capacity-sizing.ts

TypeScript

type SizingInput = {
  ingressPerSecond: number;    // lambda
  avgServiceSeconds: number;   // W
  targetUtilization: number;   // e.g. 0.65
  retryMultiplier?: number;    // e.g. 1.2
};

export function requiredWorkers(input: SizingInput): number {
  const base = (input.ingressPerSecond * input.avgServiceSeconds) / input.targetUtilization;
  const retries = input.retryMultiplier ?? 1.0;
  return Math.ceil(base * retries);
}

// Example:
// 120 jobs/s * 0.40s / 0.65 = 73.8 -> 74 workers
// retry multiplier 1.2 -> 89 workers

Capacity planning policy config (YAML)

capacity-plan.yaml

YAML

capacity_planning:
  target_utilization: 0.65
  retry_multiplier: 1.2
  guardrails:
    dispatch_p99_seconds_warn: 1
    failed_ratio_5m_warn: 0.10
    stale_jobs_warn: 50
  policy_dependency:
    fail_mode: closed
    max_tolerated_safety_unavailable_rate_5m: 0.05
  headroom:
    minimum_spare_workers_percent: 20
    burst_window_minutes: 15

Core capacity validation queries (PromQL)

capacity-signals.promql

PromQL

# Dispatch p99
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))

# Failed completion ratio
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)

# Safety dependency degradation
rate(cordum_safety_unavailable_total[5m])

# Recovery debt
cordum_scheduler_stale_jobs
rate(cordum_scheduler_orphan_replayed_total[5m])

Limitations and tradeoffs

- Simple sizing formulas assume stationarity; real workloads can shift faster than planning windows.
- Conservative utilization targets increase reliability but can reduce cost efficiency.
- Retry multipliers are rough estimates until measured under incident-like conditions.
- Autoscaling can still overshoot when its metric no longer tracks useful work.

Next step

Run this in one sprint:

1. Baseline p95 ingress and p90 service time for your top three topics.
2. Compute worker targets with utilization 0.65 and retry multiplier 1.2.
3. Add guardrails for dispatch p99, failed ratio, and stale jobs.
4. Validate the plan with one controlled load test and one dependency-degradation drill.

Continue with AI Agent Chaos Engineering Playbook and AI Agent Backpressure and Queue Drain Strategy.

AI Agent Capacity Planning Model