The production problem
Many agent teams define one availability SLO, wire one alert, and call it done. Incident reviews then show a different story: policy outages, retry storms, and stale queue debt all mixed into one number.
That makes the error budget useless. You cannot decide whether to freeze releases, tune policies, or add scheduler capacity.
A practical agent SLO model needs separate budgets for reliability failure, governance dependency degradation, and latency.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Google SRE Workbook: Error Budget Policy | Defines release policy based on budget spend, with clear operational consequences. | No agent-specific split between policy denials, dependency outages, and dispatch failures. |
| Google SRE Workbook: Alerting on SLOs | Excellent burn-rate math, including 14.4x and 6x paging thresholds. | No mapping to queue-driven control planes with scheduler, replay, and policy-gate metrics. |
| Atlassian: What is an error budget? | Clear downtime math for common uptime objectives. | Stops at uptime framing; does not cover autonomous execution side effects or policy outcomes. |
SLO and budget model
For autonomous agents, treat budgets as operational decision tools. One budget per failure family keeps paging, release policy, and remediation ownership clear.
| Budget | Target | What burns it | Primary owner |
|---|---|---|---|
| Reliability budget | 99.9% over 30d | Execution failures and timeouts only | On-call SRE / platform team |
| Governance availability budget | 99.95% over 30d | Policy service unavailable (`safety_unavailable`) events | Safety kernel + platform team |
| Latency budget | p99 dispatch < 1s | Sustained p99 dispatch latency threshold breaches | Scheduler owners |
| Replay debt budget | Near-zero orphan replay trend | Growing orphan replay counts and stale jobs backlog | Operations + incident response |
Burn-rate alerting should follow multi-window guidance: page on fast burn, ticket on slow sustained burn. For a 99.9% SLO, 14.4x and 6x thresholds are practical starting points.
Cordum metric mapping
| Implication | Current behavior | Why it matters |
|---|---|---|
| Primary denominator | `cordum_jobs_completed_total` | Completed jobs give a stable denominator for failure-ratio based reliability SLOs. |
| Governance outage signal | `cordum_safety_unavailable_total` | Separates policy dependency outages from business-level execution failure. |
| Latency SLO signal | `cordum_scheduler_dispatch_latency_seconds` (p99 alert threshold: > 1s) | Captures scheduler health before users see major queue delay. |
| Replay debt signal | `cordum_scheduler_orphan_replayed_total` and `cordum_scheduler_stale_jobs` | Shows hidden reliability debt after partitions or scheduler stalls. |
| Retry boundary context | Max scheduling retries 50, exponential backoff 1s-30s, `retryDelayNoWorkers` 2s | Defines how quickly failures can amplify burn during sustained incidents. |
Default scheduler behavior also affects budget burn shape: retry attempts can reach 50 with exponential backoff from 1s to 30s, and no-worker retries pause 2s between attempts.
Implementation examples
Prometheus burn-rate rules (YAML)
groups:
- name: cordum-slo-burn
rules:
- alert: CordumReliabilityBurnRatePage
expr: |
(
(
rate(cordum_jobs_completed_total{status="failed"}[1h])
/ clamp_min(rate(cordum_jobs_completed_total[1h]), 0.001)
) > (14.4 * 0.001)
and
(
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
) > (14.4 * 0.001)
)
or
(
(
rate(cordum_jobs_completed_total{status="failed"}[6h])
/ clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
) > (6 * 0.001)
and
(
rate(cordum_jobs_completed_total{status="failed"}[30m])
/ clamp_min(rate(cordum_jobs_completed_total[30m]), 0.001)
) > (6 * 0.001)
)
labels:
severity: page
annotations:
summary: "Reliability budget burn is too fast"
description: "30-day 99.9% SLO burn-rate page condition matched."
- alert: CordumReliabilityBurnRateTicket
expr: |
(
(
rate(cordum_jobs_completed_total{status="failed"}[3d])
/ clamp_min(rate(cordum_jobs_completed_total[3d]), 0.001)
) > (1 * 0.001)
and
(
rate(cordum_jobs_completed_total{status="failed"}[6h])
/ clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
) > (1 * 0.001)
)
labels:
severity: ticket
annotations:
summary: "Reliability budget is draining steadily"
description: "Sustained 3-day burn; open reliability debt ticket."Release gate script by burn-rate (Bash)
#!/usr/bin/env bash
set -euo pipefail
PROM_URL="${PROM_URL:-http://prometheus:9090}"
QUERY='(
rate(cordum_jobs_completed_total{status="failed"}[6h])
/ clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
) / 0.001'
BURN_RATE=$(curl -sG "$PROM_URL/api/v1/query" --data-urlencode "query=$QUERY" \
| jq -r '.data.result[0].value[1] // "0"')
echo "current_6h_burn_rate=$BURN_RATE"
if awk "BEGIN {exit !($BURN_RATE >= 6)}"; then
echo "release_mode=freeze"
exit 2
fi
if awk "BEGIN {exit !($BURN_RATE >= 2)}"; then
echo "release_mode=restricted"
exit 1
fi
echo "release_mode=normal"OpenSLO object for reliability budget (YAML)
apiVersion: openslo/v1
kind: SLO
metadata:
name: cordum-control-plane-reliability
spec:
service: cordum-control-plane
indicator:
ratioMetric:
counter: true
good:
metricSource:
type: Prometheus
spec:
query: sum(rate(cordum_jobs_completed_total{status!="failed"}[5m]))
total:
metricSource:
type: Prometheus
spec:
query: sum(rate(cordum_jobs_completed_total[5m]))
objectives:
- displayName: "30-day reliability"
target: 0.999
timeWindow:
- duration: 30d
alertPolicies:
- kind: burnrate
name: fast-burn
threshold: "14.4x in 1h + 5m confirmation"Limitations and tradeoffs
- - Low traffic services can page too often with ratio-based burn alerts.
- - Too many SLOs can dilute ownership and slow incident response.
- - Excluding policy denials from reliability budget can hide policy quality issues if you do not track them separately.
- - Burn-rate math is simple; keeping alert routes and suppression clean is the hard part.
Next step
Run this in one sprint:
- 1. Define one 30-day reliability SLO and one 30-day governance-availability SLO.
- 2. Add 14.4x and 6x burn-rate page alerts plus a 3-day ticket alert.
- 3. Wire release policy states (`normal`, `restricted`, `freeze`) to burn thresholds.
- 4. Review two incidents and confirm each budget points to the right owner.
Continue with AI Agent Backpressure and Queue Drain Strategy and AI Agent Priority Queues and Fair Scheduling.