Skip to content
Operations

Agent FinOps: Stop AI Agents from Burning $10K

When agents autonomously chain API calls, costs compound faster than dashboards can show. The fix is policy-level budget enforcement.

Apr 1, 202611 min readBy Zvi
Per-Action
Cap tokens per call
Per-Agent
Budget per agent per window
Fleet-Level
Total spend with throttling
Operations
11 min read
Apr 1, 2026
TL;DR

AI agent cost governance is the new FinOps. Agents that autonomously fan out, chain tool calls, and spawn sub-agents can burn through budgets in hours. Traditional monitoring shows you the damage after the fact. Pre-execution policy enforcement catches it before tokens are spent.

  • - Agentic AI cost is not only inference. Orchestration, memory, retries, and oversight become major spend drivers in production.
  • - Practical cost control starts with explicit autonomy limits: retries, recursion depth, tool-call caps, and token budgets per task.
  • - Anthropic reports average Claude Code usage around $6/developer/day, with 90% of users below $12/day, while agent-team plan mode can use roughly 7x tokens.
  • - Three governance layers work together: per-action limits, per-agent budgets, and fleet-level throttling before execution.
Context

The FinOps Foundation treats AI spending as a distinct operating problem and recommends dedicated governance and cost-accountability practices. Their AI technology category guidance also highlights that AI cost data needs a separate lens from traditional cloud metrics. The gap is runtime: most teams still discover bad cost behavior after execution, not before dispatch.

The $10K wake-up call

Here is how $10,000 disappears in less than a week. Three agents run in parallel, each researching a batch of target companies. Each research call fans out to 5 sub-calls: company scrape, people lookup, tech stack analysis, signal detection, and brief generation. Each sub-call consumes roughly 50,000 tokens at $0.015 per 1,000 tokens.

3 agents x 5 sub-calls = 15 LLM calls per round
15 calls x 50K tokens x $0.015/1K = $11.25 per round
20 rounds of fan-out per batch = $225 per batch run
10 batch runs per day = $2,250 per day
$10,000 in 4.5 days

This failure mode is multiplicative: fan-out plus retries plus long contexts plus concurrency. The technical bug can be small. The invoice impact is not.

This scaling behavior is visible in provider docs too. Anthropic notes that each agent teammate keeps its own context window, teammates can continue consuming tokens when left active, and plan-mode teams can use roughly 7x tokens versus standard sessions. Source: Claude Code cost documentation. Exact multipliers vary by prompt and model, but the direction is predictable: autonomy amplifies spend.

What top FinOps guides still miss for agents

Current FinOps guidance is strong at organizational process design. The blind spot is runtime enforcement in autonomous agent pipelines, where spend explodes in minutes instead of billing cycles.

SourceWhat it covers wellGap for production agent fleets
FinOps Foundation: FinOps for AI OverviewStrong business-finance alignment model, KPI framing, and organizational ownership patterns.No dispatch-time policy decision model for autonomous agents before token spend occurs.
FinOps Foundation: FinOps for AI Technology CategoryClear taxonomy for AI cost/usage data (including token-oriented metering complexity).No runtime enforcement pattern for per-action deny/approve/throttle decisions in agent workflows.
TechTarget: 7 practical tips for agentic AI cost optimizationActionable operating advice: scenario-based TCO, model right-sizing, autonomy limits, and explicit cost/error budgets.No concrete pre-dispatch control-plane contract for immutable cost evidence and approval-bound execution.

Practical minimum: every expensive action must emit one immutable cost-evidence record with decision and reviewer. Without that, post-incident cost reviews degrade into guesswork.

Cost evidence record for expensive agent actions
{
  "run_id": "run_01JTS1A4J1V4G6WVE3AX4QYHXF",
  "agent_id": "research-agent",
  "action": "job.batch-generate",
  "estimated_cost_usd": 78.40,
  "decision": "require_approval",
  "policy_version": "v1.9.0",
  "reviewer": "[email protected]",
  "outcome": "approved"
}

Why traditional monitoring fails for AI agents

Cloud FinOps works because workloads are predictable. A Kubernetes cluster runs N pods, each consuming roughly the same resources. Dashboards show trailing metrics, and trailing metrics work when next month looks like last month.

Agents break this model. They are autonomous and concurrent. One prompt can trigger a chain of tool calls that fans out exponentially. By the time your dashboard updates, the spend has already happened. You are reading the bill, not preventing it.

Provider-level limits help, but they are usually scoped at account, project, or workspace level. They do not know your business semantics: whether this action is a safe read, a high-cost batch mutation, or a risky external side effect. Agent frameworks provide iteration controls and observability, but production dollar governance is still typically implemented outside the framework.

You would not wait for the AWS bill to discover your Lambda costs tripled. Agent cost governance requires the same shift: from trailing dashboards to leading controls. Evaluate cost before execution, not after.

Three layers of AI agent cost governance

Effective cost governance needs three layers. Each catches failures the others miss.

Layer 1: Per-Action

Cap tokens per individual LLM call. Set max_runtime_sec on every job. Kill calls that exceed thresholds before they finish. This catches the single expensive call.

Layer 2: Per-Agent

Total spend per agent per time window. A research agent gets $50/hour. A drafting agent gets $20/hour. When the budget is exhausted, the agent pauses. This catches agent loops.

Layer 3: Fleet-Level

Total spend across all agents with graceful degradation. When fleet budget hits 80%, throttle non-critical agents. At 95%, pause everything except approved workflows. This catches fan-out explosions.

Policy-as-code for agent cost control

Cost governance belongs in your policy-as-code alongside security and compliance rules. Version-controlled YAML, reviewed in pull requests, enforced before execution. Not a dashboard setting that someone forgets to update.

Cordum's Safety Kernel evaluates cost policy on every job before it runs. Here is what agent cost governance looks like as code:

safety.yaml - agent cost governance
# safety.yaml - agent cost governance
version: v1
rules:
  - id: throttle-llm-calls
    match:
      topics: ["job.*.generate", "job.*.synthesize", "job.*.research"]
      risk_tags: ["high-cost"]
    decision: allow_with_constraints
    constraints:
      max_concurrent: 3
      rate_limit: "20/hour"
      max_runtime_sec: 120
    reason: "LLM calls throttled to prevent runaway spend"

  - id: approve-expensive-batch
    match:
      topics: ["job.*.batch-generate", "job.*.bulk-enrich"]
      risk_tags: ["high-cost", "batch"]
    decision: require_approval
    constraints:
      max_runtime_sec: 600
    reason: "Batch LLM operations above $50 estimated cost need review"

  - id: deny-unbounded-loop
    match:
      topics: ["job.*.recursive-search", "job.*.agent-loop"]
      risk_tags: ["unbounded"]
    decision: deny
    reason: "Unbounded recursive agent loops blocked by policy"

  - id: allow-cached-reads
    match:
      topics: ["job.*.read", "job.*.get", "job.*.list"]
      risk_tags: []
    decision: allow
    reason: "Read operations and cached results pass through"

LLM calls are throttled to 20 per hour with a maximum of 3 concurrent. Batch operations above an estimated cost threshold pause for human review. Unbounded recursive loops are blocked outright. Read operations and cached results pass through with no overhead.

Per-topic timeouts provide a second layer of defense. If a call exceeds its timeout, the Safety Kernel kills it regardless of what the agent thinks it is doing:

Per-topic timeout configuration
# overlays/timeouts.patch.yaml
topics:
  "job.research.company":
    timeout_seconds: 120
    max_retries: 1
  "job.draft.email":
    timeout_seconds: 60
    max_retries: 1
  "job.enrich.contacts":
    timeout_seconds: 30
    max_retries: 0
  "job.generate.report":
    timeout_seconds: 180
    max_retries: 1

Approval gates for expensive agent actions

Some agent actions should not run without a human reviewing the cost implication. A batch enrichment job that will process 10,000 contacts at $0.02 each costs $200. That is worth a 30-second review before execution.

The approval gate pattern extends naturally to cost governance. When an agent submits a job with estimated cost above a threshold, the Safety Kernel returns REQUIRE_APPROVAL. The job pauses. A human sees the estimated cost, the number of sub-calls, and the policy that triggered the gate. They approve, modify, or deny. Total review time: under a minute. Cost of not reviewing: potentially thousands of dollars.

This pattern is most useful for volume-sensitive workflows. An agent can behave correctly and still overspend if the batch size is larger than expected. Cost-based approval gates catch that class of failure before money is spent.

The cost audit trail

Every agent action should record its cost alongside the action itself. Not in a separate billing system that requires a join query. In the same audit trail entry: what the agent did, what policy was evaluated, what the decision was, how many tokens were consumed, and what the cost was.

This enables three things traditional monitoring cannot:

Chargeback per agent. Which agent spent the most this week? Which workflow has the highest cost per run? Which team's agents are most efficient? These answers come from the audit trail, not from parsing provider invoices.

Cost anomaly detection. When an agent's cost per action spikes 3x from its baseline, the system flags it. Not the next billing cycle. Immediately. Because the cost data is in the event stream, not in a monthly invoice.

Budget forecasting. If you know each agent's average cost per action and how many actions it runs per day, you can forecast spend with real data. Not estimates from a spreadsheet, but numbers from production.

Getting started with agent FinOps

Three steps, in order of impact.

1
Set per-topic timeouts today. Every LLM-calling job should have a max_runtime_sec. Research calls get 120 seconds. Drafting calls get 60. Simple enrichments get 30. If a call cannot complete in its window, something is wrong. Kill it. This alone prevents the worst runaway scenarios.
2
Add throttle rules for high-cost topics. Rate-limit your most expensive agent actions. 20 research calls per hour, 3 concurrent max. This prevents fan-out explosions while still allowing agents to work. Review and adjust limits weekly based on actual usage from the audit trail.
3
Require approval for batch operations. Any job that will process more than 100 items or exceed $50 estimated cost should pause for human review. A 30-second approval decision prevents a $500 surprise. Set up a weekly cost review using audit trail data to identify optimization opportunities. Read our quickstart guide to configure these controls in five minutes.
By Zvi, CTO & Co-founder, Cordum

Decade of experience building security infrastructure at enterprise scale. Now building the governance layer for autonomous AI agents.

Stop the bleed before it starts

Add cost governance to your agent stack. Timeouts, throttles, and approval gates in five minutes.

Related reading

View all