Name: Cordum
Author: Cordum

The production problem

Teams often announce an SLA before they define SLI formulas. Later, a customer asks why uptime looked healthy while automation silently deferred thousands of jobs.

The root issue is usually metric mixing: policy denials, governance outages, and execution failures pushed into one percentage.

If contract language, internal targets, and formulas are not aligned, incident response gets political fast.

What top results miss

Source	Strong coverage	Missing piece
Google Cloud: Practical guide to setting SLOs	Strong process for choosing user journeys, SLIs, and realistic SLO targets.	Does not explain how to split policy decisions from reliability errors in autonomous execution pipelines.
Honeycomb: SLOs, SLAs, SLIs - what is the difference?	Clear conceptual definitions and relationship between contracts and engineering targets.	No implementation guidance for queue-based agent systems with retries and governance gates.
Nobl9: SLO vs SLA	Good framing for legal commitments versus internal reliability goals.	No practical formulas for policy-denied, quarantined, and deferred autonomous job paths.

SLA/SLO/SLI model

Layer	Purpose	Typical target	Owner
SLA (external)	Customer contract with potential service credits	99.5% monthly availability	Legal + product + platform
SLO (internal)	Engineering target that protects the SLA	99.9% over 30 days	SRE + platform
SLI (formula)	Good events divided by total eligible events	Depends on SLO and user journey	Service owner
Policy quality metric	Track denied/quarantined outcomes separately from reliability	Stable, explainable trend	Security/governance team

Quick downtime math for a 30-day month (43,200 minutes) helps product and legal align expectations before contract negotiation.

Availability target	Monthly downtime budget	Plain language	Yearly downtime budget
99.5%	216 minutes	3h 36m	~43h 48m
99.9%	43.2 minutes	43m 12s	~8h 45m
99.95%	21.6 minutes	21m 36s	~4h 23m

Cordum metric mapping

Implication	Metric mapping	Why it matters
Reliability denominator	`cordum_jobs_completed_total`	Completed jobs by status make failure-ratio SLI formulas explicit and auditable.
Reliability failure numerator	`cordum_jobs_completed_total{status="failed"}`	Uses the same status signal as existing failure-rate alerting in production docs.
Governance dependency degradation	`cordum_safety_unavailable_total`	Policy-kernel outage is a separate operational risk from business logic failures.
Policy outcome trend	`cordum_safety_denied_total` and `cordum_output_policy_quarantined_total`	Denials/quarantines can be healthy governance behavior and should not automatically burn reliability SLO.
Latency objective	`cordum_scheduler_dispatch_latency_seconds` with p99 target below 1s	Dispatch latency regression is usually visible before hard failures spike.

A practical rule: denials and quarantines are governance quality metrics unless your customer contract says otherwise. Failed execution and latency regression usually belong in reliability SLO accounting.

Implementation examples

Reliability SLI formula (PromQL)

reliability-sli.promql

PromQL

# Reliability SLI (30-day rolling)
# Good = all completed jobs minus failed jobs
1 - (
  sum(rate(cordum_jobs_completed_total{status="failed"}[5m]))
  / clamp_min(sum(rate(cordum_jobs_completed_total[5m])), 0.001)
)

SLA to SLO contract mapping (YAML)

agent-reliability-contract.yaml

YAML

reliability_contract:
  sla:
    target: 99.5
    window: 30d
    credit_policy:
      - threshold: "<99.5"
        credit_percent: 10
      - threshold: "<99.0"
        credit_percent: 25
  slo:
    target: 99.9
    window: 30d
    burn_policy:
      restricted_release_at: "2x budget burn"
      release_freeze_at: "6x budget burn"
  sli_formulas:
    reliability:
      numerator: "failed completed jobs"
      denominator: "all completed jobs"
    governance_availability:
      numerator: "safety_unavailable events"
      denominator: "jobs_received"

Monthly SLA credit calculation (TypeScript)

sla-credit.ts

TypeScript

type CreditRule = { threshold: number; creditPercent: number };

const CREDIT_RULES: CreditRule[] = [
  { threshold: 99.0, creditPercent: 25 },
  { threshold: 99.5, creditPercent: 10 },
];

export function calculateMonthlyCredit(slaPercent: number, measuredPercent: number, monthlyBillUsd: number): number {
  if (measuredPercent >= slaPercent) return 0;

  const matched = CREDIT_RULES.find((r) => measuredPercent < r.threshold);
  if (!matched) return 0;

  return (monthlyBillUsd * matched.creditPercent) / 100;
}

Limitations and tradeoffs

- A strict SLA can push teams toward risk-accepting behavior if governance metrics are ignored.
- Too many SLO layers create maintenance overhead and alert fatigue.
- Contract language that is too broad can force denials/quarantines into uptime accounting.
- Downtime math is simple; agreeing on eligible events is where most teams struggle.

Next step

Run this in one sprint:

1. Write one-page definitions for SLA, SLO, and SLI per critical journey.
2. Publish exact numerator/denominator queries for each SLI.
3. Add a separate governance quality panel for denied/quarantined trends.
4. Simulate one month of traffic and verify your credit policy output before signing contracts.

Continue with AI Agent SLOs and Error Budgets and AI Agent Fail-Open vs Fail-Closed.

AI Agent SLA vs SLO vs SLI