Name: Cordum
Author: Cordum

The production problem

Many agent teams define one availability SLO, wire one alert, and call it done. Incident reviews then show a different story: policy outages, retry storms, and stale queue debt all mixed into one number.

That makes the error budget useless. You cannot decide whether to freeze releases, tune policies, or add scheduler capacity.

A practical agent SLO model needs separate budgets for reliability failure, governance dependency degradation, and latency.

What top results miss

Source	Strong coverage	Missing piece
Google SRE Workbook: Error Budget Policy	Defines release policy based on budget spend, with clear operational consequences.	No agent-specific split between policy denials, dependency outages, and dispatch failures.
Google SRE Workbook: Alerting on SLOs	Excellent burn-rate math, including 14.4x and 6x paging thresholds.	No mapping to queue-driven control planes with scheduler, replay, and policy-gate metrics.
Atlassian: What is an error budget?	Clear downtime math for common uptime objectives.	Stops at uptime framing; does not cover autonomous execution side effects or policy outcomes.

SLO and budget model

For autonomous agents, treat budgets as operational decision tools. One budget per failure family keeps paging, release policy, and remediation ownership clear.

Budget	Target	What burns it	Primary owner
Reliability budget	99.9% over 30d	Execution failures and timeouts only	On-call SRE / platform team
Governance availability budget	99.95% over 30d	Policy service unavailable (`safety_unavailable`) events	Safety kernel + platform team
Latency budget	p99 dispatch < 1s	Sustained p99 dispatch latency threshold breaches	Scheduler owners
Replay debt budget	Near-zero orphan replay trend	Growing orphan replay counts and stale jobs backlog	Operations + incident response

Burn-rate alerting should follow multi-window guidance: page on fast burn, ticket on slow sustained burn. For a 99.9% SLO, 14.4x and 6x thresholds are practical starting points.

Cordum metric mapping

Implication	Current behavior	Why it matters
Primary denominator	`cordum_jobs_completed_total`	Completed jobs give a stable denominator for failure-ratio based reliability SLOs.
Governance outage signal	`cordum_safety_unavailable_total`	Separates policy dependency outages from business-level execution failure.
Latency SLO signal	`cordum_scheduler_dispatch_latency_seconds` (p99 alert threshold: > 1s)	Captures scheduler health before users see major queue delay.
Replay debt signal	`cordum_scheduler_orphan_replayed_total` and `cordum_scheduler_stale_jobs`	Shows hidden reliability debt after partitions or scheduler stalls.
Retry boundary context	Max scheduling retries 50, exponential backoff 1s-30s, `retryDelayNoWorkers` 2s	Defines how quickly failures can amplify burn during sustained incidents.

Default scheduler behavior also affects budget burn shape: retry attempts can reach 50 with exponential backoff from 1s to 30s, and no-worker retries pause 2s between attempts.

Implementation examples

Prometheus burn-rate rules (YAML)

cordum-slo-burn-rules.yaml

YAML

groups:
  - name: cordum-slo-burn
    rules:
      - alert: CordumReliabilityBurnRatePage
        expr: |
          (
            (
              rate(cordum_jobs_completed_total{status="failed"}[1h])
              / clamp_min(rate(cordum_jobs_completed_total[1h]), 0.001)
            ) > (14.4 * 0.001)
            and
            (
              rate(cordum_jobs_completed_total{status="failed"}[5m])
              / clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
            ) > (14.4 * 0.001)
          )
          or
          (
            (
              rate(cordum_jobs_completed_total{status="failed"}[6h])
              / clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
            ) > (6 * 0.001)
            and
            (
              rate(cordum_jobs_completed_total{status="failed"}[30m])
              / clamp_min(rate(cordum_jobs_completed_total[30m]), 0.001)
            ) > (6 * 0.001)
          )
        labels:
          severity: page
        annotations:
          summary: "Reliability budget burn is too fast"
          description: "30-day 99.9% SLO burn-rate page condition matched."

      - alert: CordumReliabilityBurnRateTicket
        expr: |
          (
            (
              rate(cordum_jobs_completed_total{status="failed"}[3d])
              / clamp_min(rate(cordum_jobs_completed_total[3d]), 0.001)
            ) > (1 * 0.001)
            and
            (
              rate(cordum_jobs_completed_total{status="failed"}[6h])
              / clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
            ) > (1 * 0.001)
          )
        labels:
          severity: ticket
        annotations:
          summary: "Reliability budget is draining steadily"
          description: "Sustained 3-day burn; open reliability debt ticket."

Release gate script by burn-rate (Bash)

release-budget-gate.sh

Bash

#!/usr/bin/env bash
set -euo pipefail

PROM_URL="${PROM_URL:-http://prometheus:9090}"
QUERY='(
  rate(cordum_jobs_completed_total{status="failed"}[6h])
  / clamp_min(rate(cordum_jobs_completed_total[6h]), 0.001)
) / 0.001'

BURN_RATE=$(curl -sG "$PROM_URL/api/v1/query" --data-urlencode "query=$QUERY" \
  | jq -r '.data.result[0].value[1] // "0"')

echo "current_6h_burn_rate=$BURN_RATE"

if awk "BEGIN {exit !($BURN_RATE >= 6)}"; then
  echo "release_mode=freeze"
  exit 2
fi

if awk "BEGIN {exit !($BURN_RATE >= 2)}"; then
  echo "release_mode=restricted"
  exit 1
fi

echo "release_mode=normal"

OpenSLO object for reliability budget (YAML)

cordum-control-plane-slo.yaml

YAML

apiVersion: openslo/v1
kind: SLO
metadata:
  name: cordum-control-plane-reliability
spec:
  service: cordum-control-plane
  indicator:
    ratioMetric:
      counter: true
      good:
        metricSource:
          type: Prometheus
          spec:
            query: sum(rate(cordum_jobs_completed_total{status!="failed"}[5m]))
      total:
        metricSource:
          type: Prometheus
          spec:
            query: sum(rate(cordum_jobs_completed_total[5m]))
  objectives:
    - displayName: "30-day reliability"
      target: 0.999
      timeWindow:
        - duration: 30d
  alertPolicies:
    - kind: burnrate
      name: fast-burn
      threshold: "14.4x in 1h + 5m confirmation"

Limitations and tradeoffs

- Low traffic services can page too often with ratio-based burn alerts.
- Too many SLOs can dilute ownership and slow incident response.
- Excluding policy denials from reliability budget can hide policy quality issues if you do not track them separately.
- Burn-rate math is simple; keeping alert routes and suppression clean is the hard part.

Next step

Run this in one sprint:

1. Define one 30-day reliability SLO and one 30-day governance-availability SLO.
2. Add 14.4x and 6x burn-rate page alerts plus a 3-day ticket alert.
3. Wire release policy states (`normal`, `restricted`, `freeze`) to burn thresholds.
4. Review two incidents and confirm each budget points to the right owner.

Continue with AI Agent Backpressure and Queue Drain Strategy and AI Agent Priority Queues and Fair Scheduling.

AI Agent SLOs and Error Budgets