The production problem
Teams often announce an SLA before they define SLI formulas. Later, a customer asks why uptime looked healthy while automation silently deferred thousands of jobs.
The root issue is usually metric mixing: policy denials, governance outages, and execution failures pushed into one percentage.
If contract language, internal targets, and formulas are not aligned, incident response gets political fast.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Google Cloud: Practical guide to setting SLOs | Strong process for choosing user journeys, SLIs, and realistic SLO targets. | Does not explain how to split policy decisions from reliability errors in autonomous execution pipelines. |
| Honeycomb: SLOs, SLAs, SLIs - what is the difference? | Clear conceptual definitions and relationship between contracts and engineering targets. | No implementation guidance for queue-based agent systems with retries and governance gates. |
| Nobl9: SLO vs SLA | Good framing for legal commitments versus internal reliability goals. | No practical formulas for policy-denied, quarantined, and deferred autonomous job paths. |
SLA/SLO/SLI model
| Layer | Purpose | Typical target | Owner |
|---|---|---|---|
| SLA (external) | Customer contract with potential service credits | 99.5% monthly availability | Legal + product + platform |
| SLO (internal) | Engineering target that protects the SLA | 99.9% over 30 days | SRE + platform |
| SLI (formula) | Good events divided by total eligible events | Depends on SLO and user journey | Service owner |
| Policy quality metric | Track denied/quarantined outcomes separately from reliability | Stable, explainable trend | Security/governance team |
Quick downtime math for a 30-day month (43,200 minutes) helps product and legal align expectations before contract negotiation.
| Availability target | Monthly downtime budget | Plain language | Yearly downtime budget |
|---|---|---|---|
| 99.5% | 216 minutes | 3h 36m | ~43h 48m |
| 99.9% | 43.2 minutes | 43m 12s | ~8h 45m |
| 99.95% | 21.6 minutes | 21m 36s | ~4h 23m |
Cordum metric mapping
| Implication | Metric mapping | Why it matters |
|---|---|---|
| Reliability denominator | `cordum_jobs_completed_total` | Completed jobs by status make failure-ratio SLI formulas explicit and auditable. |
| Reliability failure numerator | `cordum_jobs_completed_total{status="failed"}` | Uses the same status signal as existing failure-rate alerting in production docs. |
| Governance dependency degradation | `cordum_safety_unavailable_total` | Policy-kernel outage is a separate operational risk from business logic failures. |
| Policy outcome trend | `cordum_safety_denied_total` and `cordum_output_policy_quarantined_total` | Denials/quarantines can be healthy governance behavior and should not automatically burn reliability SLO. |
| Latency objective | `cordum_scheduler_dispatch_latency_seconds` with p99 target below 1s | Dispatch latency regression is usually visible before hard failures spike. |
A practical rule: denials and quarantines are governance quality metrics unless your customer contract says otherwise. Failed execution and latency regression usually belong in reliability SLO accounting.
Implementation examples
Reliability SLI formula (PromQL)
# Reliability SLI (30-day rolling)
# Good = all completed jobs minus failed jobs
1 - (
sum(rate(cordum_jobs_completed_total{status="failed"}[5m]))
/ clamp_min(sum(rate(cordum_jobs_completed_total[5m])), 0.001)
)SLA to SLO contract mapping (YAML)
reliability_contract:
sla:
target: 99.5
window: 30d
credit_policy:
- threshold: "<99.5"
credit_percent: 10
- threshold: "<99.0"
credit_percent: 25
slo:
target: 99.9
window: 30d
burn_policy:
restricted_release_at: "2x budget burn"
release_freeze_at: "6x budget burn"
sli_formulas:
reliability:
numerator: "failed completed jobs"
denominator: "all completed jobs"
governance_availability:
numerator: "safety_unavailable events"
denominator: "jobs_received"Monthly SLA credit calculation (TypeScript)
type CreditRule = { threshold: number; creditPercent: number };
const CREDIT_RULES: CreditRule[] = [
{ threshold: 99.0, creditPercent: 25 },
{ threshold: 99.5, creditPercent: 10 },
];
export function calculateMonthlyCredit(slaPercent: number, measuredPercent: number, monthlyBillUsd: number): number {
if (measuredPercent >= slaPercent) return 0;
const matched = CREDIT_RULES.find((r) => measuredPercent < r.threshold);
if (!matched) return 0;
return (monthlyBillUsd * matched.creditPercent) / 100;
}Limitations and tradeoffs
- - A strict SLA can push teams toward risk-accepting behavior if governance metrics are ignored.
- - Too many SLO layers create maintenance overhead and alert fatigue.
- - Contract language that is too broad can force denials/quarantines into uptime accounting.
- - Downtime math is simple; agreeing on eligible events is where most teams struggle.
Next step
Run this in one sprint:
- 1. Write one-page definitions for SLA, SLO, and SLI per critical journey.
- 2. Publish exact numerator/denominator queries for each SLI.
- 3. Add a separate governance quality panel for denied/quarantined trends.
- 4. Simulate one month of traffic and verify your credit policy output before signing contracts.
Continue with AI Agent SLOs and Error Budgets and AI Agent Fail-Open vs Fail-Closed.