Skip to content
Guide

AI Agent SLA vs SLO vs SLI

Define contract, target, and metric boundaries before your next reliability review.

Guide11 min readMar 2026
TL;DR
  • -SLA is a customer promise with consequences, usually credits.
  • -SLO is an internal reliability target that should trip action before the SLA is at risk.
  • -SLI is a formula, not a dashboard chart title.
  • -Autonomous agents need separate accounting for policy outcomes and execution failures.
Contract clarity

Tie SLA language to measurable SLIs, or escalation turns into debate.

Operational target

Set SLO tighter than SLA so teams get warning time instead of legal surprises.

Metric discipline

Use a strict numerator and denominator. If you cannot write the query, you do not have an SLI.

Scope

This guide focuses on autonomous AI agent control planes where policy denials, deferred jobs, and retries can blur reliability accountability.

The production problem

Teams often announce an SLA before they define SLI formulas. Later, a customer asks why uptime looked healthy while automation silently deferred thousands of jobs.

The root issue is usually metric mixing: policy denials, governance outages, and execution failures pushed into one percentage.

If contract language, internal targets, and formulas are not aligned, incident response gets political fast.

What top results miss

SourceStrong coverageMissing piece
Google Cloud: Practical guide to setting SLOsStrong process for choosing user journeys, SLIs, and realistic SLO targets.Does not explain how to split policy decisions from reliability errors in autonomous execution pipelines.
Honeycomb: SLOs, SLAs, SLIs - what is the difference?Clear conceptual definitions and relationship between contracts and engineering targets.No implementation guidance for queue-based agent systems with retries and governance gates.
Nobl9: SLO vs SLAGood framing for legal commitments versus internal reliability goals.No practical formulas for policy-denied, quarantined, and deferred autonomous job paths.

SLA/SLO/SLI model

LayerPurposeTypical targetOwner
SLA (external)Customer contract with potential service credits99.5% monthly availabilityLegal + product + platform
SLO (internal)Engineering target that protects the SLA99.9% over 30 daysSRE + platform
SLI (formula)Good events divided by total eligible eventsDepends on SLO and user journeyService owner
Policy quality metricTrack denied/quarantined outcomes separately from reliabilityStable, explainable trendSecurity/governance team

Quick downtime math for a 30-day month (43,200 minutes) helps product and legal align expectations before contract negotiation.

Availability targetMonthly downtime budgetPlain languageYearly downtime budget
99.5%216 minutes3h 36m~43h 48m
99.9%43.2 minutes43m 12s~8h 45m
99.95%21.6 minutes21m 36s~4h 23m

Cordum metric mapping

ImplicationMetric mappingWhy it matters
Reliability denominator`cordum_jobs_completed_total`Completed jobs by status make failure-ratio SLI formulas explicit and auditable.
Reliability failure numerator`cordum_jobs_completed_total{status="failed"}`Uses the same status signal as existing failure-rate alerting in production docs.
Governance dependency degradation`cordum_safety_unavailable_total`Policy-kernel outage is a separate operational risk from business logic failures.
Policy outcome trend`cordum_safety_denied_total` and `cordum_output_policy_quarantined_total`Denials/quarantines can be healthy governance behavior and should not automatically burn reliability SLO.
Latency objective`cordum_scheduler_dispatch_latency_seconds` with p99 target below 1sDispatch latency regression is usually visible before hard failures spike.

A practical rule: denials and quarantines are governance quality metrics unless your customer contract says otherwise. Failed execution and latency regression usually belong in reliability SLO accounting.

Implementation examples

Reliability SLI formula (PromQL)

reliability-sli.promql
PromQL
# Reliability SLI (30-day rolling)
# Good = all completed jobs minus failed jobs
1 - (
  sum(rate(cordum_jobs_completed_total{status="failed"}[5m]))
  / clamp_min(sum(rate(cordum_jobs_completed_total[5m])), 0.001)
)

SLA to SLO contract mapping (YAML)

agent-reliability-contract.yaml
YAML
reliability_contract:
  sla:
    target: 99.5
    window: 30d
    credit_policy:
      - threshold: "<99.5"
        credit_percent: 10
      - threshold: "<99.0"
        credit_percent: 25
  slo:
    target: 99.9
    window: 30d
    burn_policy:
      restricted_release_at: "2x budget burn"
      release_freeze_at: "6x budget burn"
  sli_formulas:
    reliability:
      numerator: "failed completed jobs"
      denominator: "all completed jobs"
    governance_availability:
      numerator: "safety_unavailable events"
      denominator: "jobs_received"

Monthly SLA credit calculation (TypeScript)

sla-credit.ts
TypeScript
type CreditRule = { threshold: number; creditPercent: number };

const CREDIT_RULES: CreditRule[] = [
  { threshold: 99.0, creditPercent: 25 },
  { threshold: 99.5, creditPercent: 10 },
];

export function calculateMonthlyCredit(slaPercent: number, measuredPercent: number, monthlyBillUsd: number): number {
  if (measuredPercent >= slaPercent) return 0;

  const matched = CREDIT_RULES.find((r) => measuredPercent < r.threshold);
  if (!matched) return 0;

  return (monthlyBillUsd * matched.creditPercent) / 100;
}

Limitations and tradeoffs

  • - A strict SLA can push teams toward risk-accepting behavior if governance metrics are ignored.
  • - Too many SLO layers create maintenance overhead and alert fatigue.
  • - Contract language that is too broad can force denials/quarantines into uptime accounting.
  • - Downtime math is simple; agreeing on eligible events is where most teams struggle.

Next step

Run this in one sprint:

  1. 1. Write one-page definitions for SLA, SLO, and SLI per critical journey.
  2. 2. Publish exact numerator/denominator queries for each SLI.
  3. 3. Add a separate governance quality panel for denied/quarantined trends.
  4. 4. Simulate one month of traffic and verify your credit policy output before signing contracts.

Continue with AI Agent SLOs and Error Budgets and AI Agent Fail-Open vs Fail-Closed.

Reliability words need formulas

If the contract language cannot be mapped to a query, incident calls will become legal interpretation sessions.