The problem
Infra teams already know how to block obvious bad plans. The harder problem starts after a plan passes. The agent submits, queue latency grows, dependencies shift, safety service hiccups, and retries kick in.
If you only gate once, you create a blind window between plan validation and actual dispatch. That window is where expensive incidents happen. The postmortem usually says retry loop. It rarely says brilliant architecture choice.
What top articles miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Prescriptive Guidance: Control Tower + Terraform | Strong account-level controls, behavior types (preventive, detective, proactive), and Terraform rollout workflow. | No action-level runtime governance for autonomous agents once an operation is triggered. |
| Pulumi Policies docs | Policy packs, preventative vs audit modes, and local/cloud policy workflows. | Limited guidance on approval lock semantics and retry safety after a policy decision is approved. |
| HashiCorp policy-as-code references | Sentinel policy sets, workspace scope, and policy lifecycle discipline. | No dual-gate execution model that handles kernel outages, circuit-open states, or dispatcher backoff behavior. |
The gap is consistent: good policy authoring guidance, weak runtime governance guidance. Production automation needs both.
Dual-gate model
The practical model is two independent checks with clear failure semantics:
- - Gate A (submit-time): evaluate before writing state or publishing to the bus.
- - Gate B (dispatch-time): evaluate again immediately before worker routing.
- - Approval binding: if approval is needed, bind it to policy snapshot and job hash.
- - Retry safety: idempotency key must survive retries and restarts.
Cordum docs and code implement this split explicitly across gateway and scheduler, with separate fail-mode controls and safety-client circuit behavior.
Runtime controls with real values
| Control point | Behavior | Observed values | Why it matters |
|---|---|---|---|
| Submit-time (gateway) | Policy is evaluated before persistence and publish. | 5s evaluation timeout; deny=403, throttle=429, require_human=APPROVAL state | Prevents unsafe jobs from entering queue state. |
| Dispatch-time (scheduler) | Policy is checked again before worker routing. | 2s safety client timeout + 3s engine defense timeout | Catches drift between submit and dispatch windows. |
| Safety unavailable behavior | Fail mode determines requeue or bypass. | `POLICY_CHECK_FAIL_MODE=closed|open`; default closed | Makes outage behavior explicit instead of implicit. |
| Circuit breaker | Distributed breaker for safety calls. | 3 failures opens for 30s; half-open max probes=3 | Stops cluster-wide retry storms during kernel incidents. |
| Scheduling retry cap | Bounded scheduling retries before fail. | maxSchedulingRetries=50 (~25 min with backoff) | Prevents jobs from looping forever in degraded states. |
Implementation code
1) Policy decisions by action and environment
version: v1
rules:
- id: deny-prod-public-ingress
when:
topic: infra.security_group.update
env: production
cidr: "0.0.0.0/0"
decision: deny
- id: require-approval-prod-delete
when:
topic: infra.resource.delete
env: production
decision: require_human
- id: throttle-bulk-recreate
when:
topic: infra.cluster.recreate
env: production
decision: throttle2) Idempotent orchestrator wrapper
type InfraAction = {
approvalId: string;
planHash: string;
topic: string;
payload: unknown;
};
export async function runInfraAction(action: InfraAction) {
const idempotencyKey = action.approvalId + ":" + action.planHash;
const existing = await db.actions.findUnique({ where: { idempotencyKey } });
if (existing) return existing.result;
// Submit with idempotency metadata so retries do not repeat side effects.
const response = await cordum.jobs.submit({
topic: action.topic,
labels: { change_source: "infra-agent" },
metadata: { idempotency_key: idempotencyKey },
payload: action.payload,
});
await db.actions.create({
data: { idempotencyKey, jobId: response.jobId, status: "submitted" },
});
return response;
}3) Operator checks
# Alert if any production jobs bypass safety in fail-open mode.
sum(rate(cordum_scheduler_input_fail_open_total{topic=~"infra.*"}[5m])) > 0
# Sanity check policy snapshot before rollout.
curl -s "$CORDUM_API/api/v1/policy/snapshots" | jq .
# Simulate policy decision in CI before merge.
curl -s -X POST "$CORDUM_API/api/v1/policy/simulate" \
-H "Content-Type: application/json" \
-d '{"topic":"infra.resource.delete","tenant":"prod"}' | jq .Limitations and tradeoffs
- - Two gates add latency. You trade a few milliseconds for fewer high-cost incidents.
- - Fail-open is useful for availability testing, risky for production mutation topics.
- - Strict deny rules can block urgent remediation if emergency paths are not pre-modeled.
- - Retry caps reduce infinite loops but can push unresolved work into DLQ triage load.
- - Approval gates protect production and can still become a human bottleneck during spikes.
Next step
Run this one-week rollout:
- 1. Classify your top 20 infra actions into low, medium, high risk.
- 2. Enforce submit-time deny/throttle/require_human policies for high-risk actions.
- 3. Turn on dispatch-time safety checks with `POLICY_CHECK_FAIL_MODE=closed` in production.
- 4. Add alerting on `cordum_scheduler_input_fail_open_total` and test outage behavior.
- 5. Make idempotency key coverage a release gate for every mutating action path.
Continue with pre-dispatch governance for AI agents and policy-as-code patterns.