The production problem
Most policy projects fail after the first rules file lands in Git.
The syntax works. The runtime path does not.
Teams discover late that policy updates are unsigned, approval decisions use stale snapshots, and fail-mode defaults are unclear during outages.
That is how a policy platform becomes a logging platform.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Open Policy Agent docs | Policy-as-code foundations and decoupling authorization from app logic. | No agent-control-plane walkthrough for policy snapshots, approval drift checks, and sink-specific AI constraints. |
| Cedar policy language reference | Authorization model, schema validation, and policy readability at scale. | No execution-path guidance for queue-driven AI jobs where decisions must survive retries and delayed approvals. |
| AWS AgentCore policy blog | Policy authoring options and real-time interception ideas for agent tooling. | No explicit fail-mode contract and no open code path for policy-signature enforcement and approval snapshot conflicts. |
Policy model that survives production
A practical custom policy system needs five properties:
Deterministic decisions. Snapshot lineage. Source integrity. Rollout simulation. Drift-safe approvals.
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Decision contract | `allow`, `deny`, `require_approval`, `throttle`, `allow_with_constraints` map to strict protobuf enums. | No ambiguous runtime behavior when policy authors add new rules. |
| Snapshot lineage | Each evaluation returns `PolicySnapshot`; approval path checks snapshot consistency before final approval. | Prevents approval on stale policy after security changes. |
| Source integrity | Policy can load from file or URL, but signature verification is required in production. | Blocks silent policy tampering and accidental unsigned rollout. |
| Transport hardening | Production rejects `http://` policy URLs and blocks private hosts unless explicitly enabled. | Reduces SSRF and policy-supply-chain attack surface. |
| Pre-rollout simulation | Gateway exposes `/api/v1/policy/evaluate|simulate|explain` endpoints for dry-run analysis. | Lets teams catch blast-radius errors before policy publish. |
Concrete code paths
Custom policy skeleton
# config/safety.yaml (excerpt)
default_decision: deny
rules:
- id: workflow-approval-gate
match:
topics:
- job.cordum.approval-gate
decision: require_approval
reason: "Workflow approval gates require explicit human authorization."
input_policy:
fail_mode: closed
output_policy:
enabled: true
fail_mode: closedDecision mapping and response shape
// core/controlplane/safetykernel/kernel.go (excerpt)
switch policyDecision.Decision {
case "deny":
decision = pb.DecisionType_DECISION_TYPE_DENY
case "require_approval":
decision = pb.DecisionType_DECISION_TYPE_REQUIRE_HUMAN
case "throttle":
decision = pb.DecisionType_DECISION_TYPE_THROTTLE
case "allow_with_constraints":
decision = pb.DecisionType_DECISION_TYPE_ALLOW_WITH_CONSTRAINTS
case "allow":
decision = pb.DecisionType_DECISION_TYPE_ALLOW
}
resp := &pb.PolicyCheckResponse{
Decision: decision,
PolicySnapshot: snapshot,
RuleId: ruleID,
ApprovalRequired: approvalRequired,
}Policy source and snapshot build
// core/controlplane/safetykernel/kernel.go (excerpt)
func policySourceFromEnv(path string) string {
if raw := strings.TrimSpace(os.Getenv("SAFETY_POLICY_URL")); raw != "" {
return raw
}
return strings.TrimSpace(path)
}
func loadPolicyBundle(source string) (*config.SafetyPolicy, string, error) {
data, err := readPolicySource(source)
if err != nil { return nil, "", err }
if err := verifyPolicySignature(data, source); err != nil { return nil, "", err }
policy, err := config.ParseSafetyPolicy(data)
// snapshot = "<version>:<sha256>"
}Signature enforcement (Ed25519)
// core/controlplane/safetykernel/kernel.go (excerpt)
func verifyPolicySignature(data []byte, source string) error {
requireSignature := env.IsProduction() || env.Bool("SAFETY_POLICY_SIGNATURE_REQUIRED")
if pubRaw == "" && requireSignature {
return errors.New("policy signature required but SAFETY_POLICY_PUBLIC_KEY not configured")
}
if !ed25519.Verify(ed25519.PublicKey(pubKey), data, sig) {
return errors.New("policy signature verification failed")
}
return nil
}Approval drift guard
// core/controlplane/gateway/handlers_approvals.go (excerpt)
if currentSnapshot == "" || snapshotBase(currentSnapshot) != snapshotBase(policySnapshot) {
result = handlerResult{http.StatusConflict, "policy snapshot changed; re-evaluate before approving"}
return nil
}Validation runbook
Run this checklist before every policy publish:
# 1) Validate policy source and signature enforcement
go test ./core/controlplane/safetykernel -run TestVerifyPolicySignature -count=1
go test ./core/controlplane/safetykernel -run TestVerifyPolicySignatureProductionRequiresKey -count=1
go test ./core/controlplane/safetykernel -run TestFetchPolicyURLRejectsHTTPInProduction -count=1
# 2) Validate policy decision APIs
curl -sS -X POST http://localhost:8081/api/v1/policy/evaluate \
-H "Content-Type: application/json" \
-d '{"topic":"job.default","tenant":"default","principal_id":"ops-admin"}'
curl -sS -X POST http://localhost:8081/api/v1/policy/simulate \
-H "Content-Type: application/json" \
-d '{"topic":"job.default","tenant":"default","principal_id":"ops-admin"}'
# 3) Validate approval snapshot drift protection
go test ./core/controlplane/gateway -run TestApprovalsRequireCurrentPolicySnapshot -count=1
# 4) Validate scheduler fail mode remains closed by default
go test ./core/controlplane/scheduler -run TestSafetyUnavailable_FailClosed -count=1Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Loose policy rules + manual review | Fast initial setup. | High human load, inconsistent enforcement, poor incident reproducibility. |
| Deterministic policy + snapshot enforcement (current) | Consistent behavior across retries, approvals, and distributed replicas. | Requires disciplined policy testing and version management. |
| Unsigned remote policy fetch | Operational convenience for quick edits. | Unsafe supply chain surface; production drift and tamper risk. |
- - First-match rules in YAML are powerful and dangerous. Rule ordering errors create hidden policy regressions.
- - Strict signatures improve integrity but force stronger key-management discipline.
- - Simulation endpoints reduce risk, but only if your test payloads reflect real tenant traffic.
FAQ
What makes AI policy enforcement deterministic?
A fixed decision contract, explicit fail-mode behavior, snapshoted policy lineage, and replay-safe state transitions.
Why are policy snapshots critical for approvals?
Approvals may happen minutes later. Snapshot checks prevent approving under a different policy than the one originally evaluated.
Do I need signature verification in non-production?
You can relax it in dev, but production should always enforce signatures to prevent policy tampering.
Next step
Do this in your next sprint:
- 1. Add signature enforcement to staging first, then production.
- 2. Require policy simulation evidence in code review before any policy merge.
- 3. Add an approval-SLA alert for snapshot mismatch conflicts.
- 4. Keep `default_decision: deny` and `input_policy.fail_mode: closed` unless risk owners approve exceptions.
Continue with AI Agent Policy Simulation and Prompt Injection vs Out-of-Process Governance.