The $10K wake-up call
Here is how $10,000 disappears in less than a week. Three agents run in parallel, each researching a batch of target companies. Each research call fans out to 5 sub-calls: company scrape, people lookup, tech stack analysis, signal detection, and brief generation. Each sub-call consumes roughly 50,000 tokens at $0.015 per 1,000 tokens.
This failure mode is multiplicative: fan-out plus retries plus long contexts plus concurrency. The technical bug can be small. The invoice impact is not.
This scaling behavior is visible in provider docs too. Anthropic notes that each agent teammate keeps its own context window, teammates can continue consuming tokens when left active, and plan-mode teams can use roughly 7x tokens versus standard sessions. Source: Claude Code cost documentation. Exact multipliers vary by prompt and model, but the direction is predictable: autonomy amplifies spend.
What top FinOps guides still miss for agents
Current FinOps guidance is strong at organizational process design. The blind spot is runtime enforcement in autonomous agent pipelines, where spend explodes in minutes instead of billing cycles.
| Source | What it covers well | Gap for production agent fleets |
|---|---|---|
| FinOps Foundation: FinOps for AI Overview | Strong business-finance alignment model, KPI framing, and organizational ownership patterns. | No dispatch-time policy decision model for autonomous agents before token spend occurs. |
| FinOps Foundation: FinOps for AI Technology Category | Clear taxonomy for AI cost/usage data (including token-oriented metering complexity). | No runtime enforcement pattern for per-action deny/approve/throttle decisions in agent workflows. |
| TechTarget: 7 practical tips for agentic AI cost optimization | Actionable operating advice: scenario-based TCO, model right-sizing, autonomy limits, and explicit cost/error budgets. | No concrete pre-dispatch control-plane contract for immutable cost evidence and approval-bound execution. |
Practical minimum: every expensive action must emit one immutable cost-evidence record with decision and reviewer. Without that, post-incident cost reviews degrade into guesswork.
{
"run_id": "run_01JTS1A4J1V4G6WVE3AX4QYHXF",
"agent_id": "research-agent",
"action": "job.batch-generate",
"estimated_cost_usd": 78.40,
"decision": "require_approval",
"policy_version": "v1.9.0",
"reviewer": "[email protected]",
"outcome": "approved"
}Why traditional monitoring fails for AI agents
Cloud FinOps works because workloads are predictable. A Kubernetes cluster runs N pods, each consuming roughly the same resources. Dashboards show trailing metrics, and trailing metrics work when next month looks like last month.
Agents break this model. They are autonomous and concurrent. One prompt can trigger a chain of tool calls that fans out exponentially. By the time your dashboard updates, the spend has already happened. You are reading the bill, not preventing it.
Provider-level limits help, but they are usually scoped at account, project, or workspace level. They do not know your business semantics: whether this action is a safe read, a high-cost batch mutation, or a risky external side effect. Agent frameworks provide iteration controls and observability, but production dollar governance is still typically implemented outside the framework.
You would not wait for the AWS bill to discover your Lambda costs tripled. Agent cost governance requires the same shift: from trailing dashboards to leading controls. Evaluate cost before execution, not after.
Three layers of AI agent cost governance
Effective cost governance needs three layers. Each catches failures the others miss.
Cap tokens per individual LLM call. Set max_runtime_sec on every job. Kill calls that exceed thresholds before they finish. This catches the single expensive call.
Total spend per agent per time window. A research agent gets $50/hour. A drafting agent gets $20/hour. When the budget is exhausted, the agent pauses. This catches agent loops.
Total spend across all agents with graceful degradation. When fleet budget hits 80%, throttle non-critical agents. At 95%, pause everything except approved workflows. This catches fan-out explosions.
Policy-as-code for agent cost control
Cost governance belongs in your policy-as-code alongside security and compliance rules. Version-controlled YAML, reviewed in pull requests, enforced before execution. Not a dashboard setting that someone forgets to update.
Cordum's Safety Kernel evaluates cost policy on every job before it runs. Here is what agent cost governance looks like as code:
# safety.yaml - agent cost governance
version: v1
rules:
- id: throttle-llm-calls
match:
topics: ["job.*.generate", "job.*.synthesize", "job.*.research"]
risk_tags: ["high-cost"]
decision: allow_with_constraints
constraints:
max_concurrent: 3
rate_limit: "20/hour"
max_runtime_sec: 120
reason: "LLM calls throttled to prevent runaway spend"
- id: approve-expensive-batch
match:
topics: ["job.*.batch-generate", "job.*.bulk-enrich"]
risk_tags: ["high-cost", "batch"]
decision: require_approval
constraints:
max_runtime_sec: 600
reason: "Batch LLM operations above $50 estimated cost need review"
- id: deny-unbounded-loop
match:
topics: ["job.*.recursive-search", "job.*.agent-loop"]
risk_tags: ["unbounded"]
decision: deny
reason: "Unbounded recursive agent loops blocked by policy"
- id: allow-cached-reads
match:
topics: ["job.*.read", "job.*.get", "job.*.list"]
risk_tags: []
decision: allow
reason: "Read operations and cached results pass through"LLM calls are throttled to 20 per hour with a maximum of 3 concurrent. Batch operations above an estimated cost threshold pause for human review. Unbounded recursive loops are blocked outright. Read operations and cached results pass through with no overhead.
Per-topic timeouts provide a second layer of defense. If a call exceeds its timeout, the Safety Kernel kills it regardless of what the agent thinks it is doing:
# overlays/timeouts.patch.yaml
topics:
"job.research.company":
timeout_seconds: 120
max_retries: 1
"job.draft.email":
timeout_seconds: 60
max_retries: 1
"job.enrich.contacts":
timeout_seconds: 30
max_retries: 0
"job.generate.report":
timeout_seconds: 180
max_retries: 1Approval gates for expensive agent actions
Some agent actions should not run without a human reviewing the cost implication. A batch enrichment job that will process 10,000 contacts at $0.02 each costs $200. That is worth a 30-second review before execution.
The approval gate pattern extends naturally to cost governance. When an agent submits a job with estimated cost above a threshold, the Safety Kernel returns REQUIRE_APPROVAL. The job pauses. A human sees the estimated cost, the number of sub-calls, and the policy that triggered the gate. They approve, modify, or deny. Total review time: under a minute. Cost of not reviewing: potentially thousands of dollars.
This pattern is most useful for volume-sensitive workflows. An agent can behave correctly and still overspend if the batch size is larger than expected. Cost-based approval gates catch that class of failure before money is spent.
The cost audit trail
Every agent action should record its cost alongside the action itself. Not in a separate billing system that requires a join query. In the same audit trail entry: what the agent did, what policy was evaluated, what the decision was, how many tokens were consumed, and what the cost was.
This enables three things traditional monitoring cannot:
Chargeback per agent. Which agent spent the most this week? Which workflow has the highest cost per run? Which team's agents are most efficient? These answers come from the audit trail, not from parsing provider invoices.
Cost anomaly detection. When an agent's cost per action spikes 3x from its baseline, the system flags it. Not the next billing cycle. Immediately. Because the cost data is in the event stream, not in a monthly invoice.
Budget forecasting. If you know each agent's average cost per action and how many actions it runs per day, you can forecast spend with real data. Not estimates from a spreadsheet, but numbers from production.
Getting started with agent FinOps
Three steps, in order of impact.