Skip to content
Guide

AI Agent SLOs and Error Budgets

Set reliability targets that reflect how autonomous systems actually fail.

Guide12 min readMar 2026
TL;DR
  • -Most teams burn error budget on the wrong numerator.
  • -Policy denials and reliability failures should not share one budget.
  • -Multi-window burn-rate paging catches incidents faster than static percentage alerts.
  • -Use explicit budget policy to decide when deploys slow down or freeze.
Budget separation

Track reliability burn separately from governance and latency regressions.

Burn-rate paging

Use 1h and 6h windows for pages, then 3d for ticket-level reliability debt.

Operational realism

Map SLOs to the metrics and thresholds your scheduler already emits.

Scope

This guide is for teams running autonomous agents where retries, policy checks, and distributed queueing can distort naive uptime-style SLOs.

The production problem

Many agent teams define one availability SLO, wire one alert, and call it done. Incident reviews then show a different story: policy outages, retry storms, and stale queue debt all mixed into one number.

That makes the error budget useless. You cannot decide whether to freeze releases, tune policies, or add scheduler capacity.

A practical agent SLO model needs separate budgets for reliability failure, governance dependency degradation, and latency.

What top results miss

SourceStrong coverageMissing piece
Google SRE Workbook: Error Budget PolicyDefines release policy based on budget spend, with clear operational consequences.No agent-specific split between policy denials, dependency outages, and dispatch failures.
Google SRE Workbook: Alerting on SLOsExcellent burn-rate math, including 14.4x and 6x paging thresholds.No mapping to queue-driven control planes with scheduler, replay, and policy-gate metrics.
Atlassian: What is an error budget?Clear downtime math for common uptime objectives.Stops at uptime framing; does not cover autonomous execution side effects or policy outcomes.

SLO and budget model

For autonomous agents, treat budgets as operational decision tools. One budget per failure family keeps paging, release policy, and remediation ownership clear.

BudgetTargetWhat burns itPrimary owner
Reliability budget99.9% over 30dExecution failures and timeouts onlyOn-call SRE / platform team
Governance availability budget99.95% over 30dPolicy service unavailable (`safety_unavailable`) eventsSafety kernel + platform team
Latency budgetp99 dispatch < 1sSustained p99 dispatch latency threshold breachesScheduler owners
Replay debt budgetNear-zero orphan replay trendGrowing orphan replay counts and stale jobs backlogOperations + incident response

Burn-rate alerting should follow multi-window guidance: page on fast burn, ticket on slow sustained burn. For a 99.9% SLO, 14.4x and 6x thresholds are practical starting points.

Cordum metric mapping

ImplicationCurrent behaviorWhy it matters
Primary denominator`cordum_jobs_completed_total`Completed jobs give a stable denominator for failure-ratio based reliability SLOs.
Governance outage signal`cordum_safety_unavailable_total`Separates policy dependency outages from business-level execution failure.
Latency SLO signal`cordum_scheduler_dispatch_latency_seconds` (p99 alert threshold: > 1s)Captures scheduler health before users see major queue delay.
Replay debt signal`cordum_scheduler_orphan_replayed_total` and `cordum_scheduler_stale_jobs`Shows hidden reliability debt after partitions or scheduler stalls.
Retry boundary contextMax scheduling retries 50, exponential backoff 1s-30s, `retryDelayNoWorkers` 2sDefines how quickly failures can amplify burn during sustained incidents.

Default scheduler behavior also affects budget burn shape: retry attempts can reach 50 with exponential backoff from 1s to 30s, and no-worker retries pause 2s between attempts.

Implementation examples

Prometheus burn-rate rules (YAML)

cordum-slo-burn-rules.yaml
YAML
groups:
  - name: cordum-slo-burn
    rules:
      - alert: CordumReliabilityBurnRatePage
        expr: |
          (
            (
              rate(cordum_jobs_completed_total{status="failed"}[1h])
              / clamp_min(rate(cordum_jobs_completed_total[1h]), 0.001)
            ) > (14.4 * 0.001)
            and
            (
              rate(cordum_jobs_completed_total{status="failed"}[5m])
              / clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
            ) > (14.4 * 0.001)
          )
          or
          (
            (
              rate(cordum_jobs_completed_total{status="failed"}[6h])
              / clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
            ) > (6 * 0.001)
            and
            (
              rate(cordum_jobs_completed_total{status="failed"}[30m])
              / clamp_min(rate(cordum_jobs_completed_total[30m]), 0.001)
            ) > (6 * 0.001)
          )
        labels:
          severity: page
        annotations:
          summary: "Reliability budget burn is too fast"
          description: "30-day 99.9% SLO burn-rate page condition matched."

      - alert: CordumReliabilityBurnRateTicket
        expr: |
          (
            (
              rate(cordum_jobs_completed_total{status="failed"}[3d])
              / clamp_min(rate(cordum_jobs_completed_total[3d]), 0.001)
            ) > (1 * 0.001)
            and
            (
              rate(cordum_jobs_completed_total{status="failed"}[6h])
              / clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
            ) > (1 * 0.001)
          )
        labels:
          severity: ticket
        annotations:
          summary: "Reliability budget is draining steadily"
          description: "Sustained 3-day burn; open reliability debt ticket."

Release gate script by burn-rate (Bash)

release-budget-gate.sh
Bash
#!/usr/bin/env bash
set -euo pipefail

PROM_URL="${PROM_URL:-http://prometheus:9090}"
QUERY='(
  rate(cordum_jobs_completed_total{status="failed"}[6h])
  / clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
) / 0.001'

BURN_RATE=$(curl -sG "$PROM_URL/api/v1/query" --data-urlencode "query=$QUERY" \
  | jq -r '.data.result[0].value[1] // "0"')

echo "current_6h_burn_rate=$BURN_RATE"

if awk "BEGIN {exit !($BURN_RATE >= 6)}"; then
  echo "release_mode=freeze"
  exit 2
fi

if awk "BEGIN {exit !($BURN_RATE >= 2)}"; then
  echo "release_mode=restricted"
  exit 1
fi

echo "release_mode=normal"

OpenSLO object for reliability budget (YAML)

cordum-control-plane-slo.yaml
YAML
apiVersion: openslo/v1
kind: SLO
metadata:
  name: cordum-control-plane-reliability
spec:
  service: cordum-control-plane
  indicator:
    ratioMetric:
      counter: true
      good:
        metricSource:
          type: Prometheus
          spec:
            query: sum(rate(cordum_jobs_completed_total{status!="failed"}[5m]))
      total:
        metricSource:
          type: Prometheus
          spec:
            query: sum(rate(cordum_jobs_completed_total[5m]))
  objectives:
    - displayName: "30-day reliability"
      target: 0.999
      timeWindow:
        - duration: 30d
  alertPolicies:
    - kind: burnrate
      name: fast-burn
      threshold: "14.4x in 1h + 5m confirmation"

Limitations and tradeoffs

  • - Low traffic services can page too often with ratio-based burn alerts.
  • - Too many SLOs can dilute ownership and slow incident response.
  • - Excluding policy denials from reliability budget can hide policy quality issues if you do not track them separately.
  • - Burn-rate math is simple; keeping alert routes and suppression clean is the hard part.

Next step

Run this in one sprint:

  1. 1. Define one 30-day reliability SLO and one 30-day governance-availability SLO.
  2. 2. Add 14.4x and 6x burn-rate page alerts plus a 3-day ticket alert.
  3. 3. Wire release policy states (`normal`, `restricted`, `freeze`) to burn thresholds.
  4. 4. Review two incidents and confirm each budget points to the right owner.

Continue with AI Agent Backpressure and Queue Drain Strategy and AI Agent Priority Queues and Fair Scheduling.

Error budgets should change behavior

If your budget policy does not affect paging and releases, it is reporting, not governance.