Skip to content
Guide

Infrastructure Automation AI Agent Guardrails

The failure mode is simple: one bad retry can mutate production twice. Guardrails must be explicit at every control point.

Guide12 min readApr 2026
TL;DR
  • -One policy check is not enough. You need one gate at submit-time and another at dispatch-time.
  • -Cordum runs submit-time checks before state persistence, then dispatch-time checks with a 3s engine timeout and fail-mode controls.
  • -Circuit breaking is mandatory for stability: 3 failures opens input safety for 30s in the scheduler client.
Submit-time gate

Reject, throttle, or route to approval before writing job state

Dispatch-time gate

Re-check policy right before worker dispatch

Retry-safe execution

Idempotency keys + bounded retries + DLQ fallback

Scope

This guide is for autonomous infrastructure actions, not manual IaC review flow. The focus is policy decisions, approval transitions, and runtime behavior under partial outages.

The problem

Infra teams already know how to block obvious bad plans. The harder problem starts after a plan passes. The agent submits, queue latency grows, dependencies shift, safety service hiccups, and retries kick in.

If you only gate once, you create a blind window between plan validation and actual dispatch. That window is where expensive incidents happen. The postmortem usually says retry loop. It rarely says brilliant architecture choice.

What top articles miss

SourceStrong coverageMissing piece
AWS Prescriptive Guidance: Control Tower + TerraformStrong account-level controls, behavior types (preventive, detective, proactive), and Terraform rollout workflow.No action-level runtime governance for autonomous agents once an operation is triggered.
Pulumi Policies docsPolicy packs, preventative vs audit modes, and local/cloud policy workflows.Limited guidance on approval lock semantics and retry safety after a policy decision is approved.
HashiCorp policy-as-code referencesSentinel policy sets, workspace scope, and policy lifecycle discipline.No dual-gate execution model that handles kernel outages, circuit-open states, or dispatcher backoff behavior.

The gap is consistent: good policy authoring guidance, weak runtime governance guidance. Production automation needs both.

Dual-gate model

The practical model is two independent checks with clear failure semantics:

  • - Gate A (submit-time): evaluate before writing state or publishing to the bus.
  • - Gate B (dispatch-time): evaluate again immediately before worker routing.
  • - Approval binding: if approval is needed, bind it to policy snapshot and job hash.
  • - Retry safety: idempotency key must survive retries and restarts.

Cordum docs and code implement this split explicitly across gateway and scheduler, with separate fail-mode controls and safety-client circuit behavior.

Runtime controls with real values

Control pointBehaviorObserved valuesWhy it matters
Submit-time (gateway)Policy is evaluated before persistence and publish.5s evaluation timeout; deny=403, throttle=429, require_human=APPROVAL statePrevents unsafe jobs from entering queue state.
Dispatch-time (scheduler)Policy is checked again before worker routing.2s safety client timeout + 3s engine defense timeoutCatches drift between submit and dispatch windows.
Safety unavailable behaviorFail mode determines requeue or bypass.`POLICY_CHECK_FAIL_MODE=closed|open`; default closedMakes outage behavior explicit instead of implicit.
Circuit breakerDistributed breaker for safety calls.3 failures opens for 30s; half-open max probes=3Stops cluster-wide retry storms during kernel incidents.
Scheduling retry capBounded scheduling retries before fail.maxSchedulingRetries=50 (~25 min with backoff)Prevents jobs from looping forever in degraded states.

Implementation code

1) Policy decisions by action and environment

infra-policy.yaml
YAML
version: v1
rules:
  - id: deny-prod-public-ingress
    when:
      topic: infra.security_group.update
      env: production
      cidr: "0.0.0.0/0"
    decision: deny

  - id: require-approval-prod-delete
    when:
      topic: infra.resource.delete
      env: production
    decision: require_human

  - id: throttle-bulk-recreate
    when:
      topic: infra.cluster.recreate
      env: production
    decision: throttle

2) Idempotent orchestrator wrapper

orchestrator.ts
TypeScript
type InfraAction = {
  approvalId: string;
  planHash: string;
  topic: string;
  payload: unknown;
};

export async function runInfraAction(action: InfraAction) {
  const idempotencyKey = action.approvalId + ":" + action.planHash;

  const existing = await db.actions.findUnique({ where: { idempotencyKey } });
  if (existing) return existing.result;

  // Submit with idempotency metadata so retries do not repeat side effects.
  const response = await cordum.jobs.submit({
    topic: action.topic,
    labels: { change_source: "infra-agent" },
    metadata: { idempotency_key: idempotencyKey },
    payload: action.payload,
  });

  await db.actions.create({
    data: { idempotencyKey, jobId: response.jobId, status: "submitted" },
  });

  return response;
}

3) Operator checks

ops-checks.sh
Bash
# Alert if any production jobs bypass safety in fail-open mode.
sum(rate(cordum_scheduler_input_fail_open_total{topic=~"infra.*"}[5m])) > 0

# Sanity check policy snapshot before rollout.
curl -s "$CORDUM_API/api/v1/policy/snapshots" | jq .

# Simulate policy decision in CI before merge.
curl -s -X POST "$CORDUM_API/api/v1/policy/simulate" \
  -H "Content-Type: application/json" \
  -d '{"topic":"infra.resource.delete","tenant":"prod"}' | jq .

Limitations and tradeoffs

  • - Two gates add latency. You trade a few milliseconds for fewer high-cost incidents.
  • - Fail-open is useful for availability testing, risky for production mutation topics.
  • - Strict deny rules can block urgent remediation if emergency paths are not pre-modeled.
  • - Retry caps reduce infinite loops but can push unresolved work into DLQ triage load.
  • - Approval gates protect production and can still become a human bottleneck during spikes.

Next step

Run this one-week rollout:

  1. 1. Classify your top 20 infra actions into low, medium, high risk.
  2. 2. Enforce submit-time deny/throttle/require_human policies for high-risk actions.
  3. 3. Turn on dispatch-time safety checks with `POLICY_CHECK_FAIL_MODE=closed` in production.
  4. 4. Add alerting on `cordum_scheduler_input_fail_open_total` and test outage behavior.
  5. 5. Make idempotency key coverage a release gate for every mutating action path.

Continue with pre-dispatch governance for AI agents and policy-as-code patterns.

Operate safely

Fast automation is useful. Predictable automation is what keeps you out of incident review at 03:00.