Skip to content
Deep Dive

AI Agent Policy Decision Cache Invalidation

Fast policy checks are useful. Stale policy checks are expensive.

Deep Dive11 min readMar 2026
TL;DR
  • -TTL-only caching is not enough for policy decisions that can change between deploys.
  • -Cordum uses snapshot-prefixed request hashes and a policyVersion guard to avoid stale hits.
  • -Cached policy responses strip `approval_ref` and re-bind it to the current `job_id` on cache hit.
  • -Velocity-sensitive policy rules bypass cache entirely to avoid rate-limit correctness issues.
Snapshot keying

Cache keys include policy snapshot so reloads naturally create misses.

Version guard

Each entry is tagged with policyVersion and dropped if versions diverge.

Safe rebinding

Approval references are rebuilt per request, not persisted inside cache entries.

Scope

This guide focuses on input-policy decision caching inside a Safety Kernel, where stale cache hits can directly change autonomous dispatch behavior.

The production problem

Policy engines get cache layers for speed. Then policy rules change. If invalidation is loose, old decisions keep leaking into new traffic.

For autonomous agents, this is not a cosmetic bug. A stale `allow` can skip an approval gate. A stale `deny` can block revenue traffic. Both are policy incidents.

You need invalidation that is explicit, testable, and resilient to concurrent reload races.

What top results miss

SourceStrong coverageMissing piece
Azure Cache-Aside patternCache staleness risks, invalidation order, and expiration strategy tradeoffs.No policy-engine semantics like approval references and snapshot-aware governance.
Redis client-side cachingServer-assisted tracking and invalidation messages when tracked keys change.No treatment of policy reload boundaries or per-request identity rebinding.
Google Media CDN cache invalidationOperational invalidation scope, latency, and origin load impact during purge.No strategy for correctness-critical policy decisions in control-plane hot paths.

The gap is governance correctness: policy caches must preserve identity-sensitive fields and invalidate on rule lineage changes, not only on wall-clock expiry.

Invalidation model

StrategyStrengthRisk
TTL-only expirySimple and low implementation costServes stale policy after rule change until TTL expires
Explicit purge on policy updateImmediate invalidation after reloadRequires reliable fanout to all replicas
Snapshot-prefixed keyAutomatic miss when snapshot changesOld entries remain until eviction unless purged
Version guard in entryProtects against races and partial invalidationSlight lookup overhead per cache hit attempt
Sensitive field strip/rebindPrevents cross-request identity leakageRequires precise request rehydration logic

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Cache controls`SAFETY_DECISION_CACHE_TTL` and `SAFETY_DECISION_CACHE_MAX_SIZE` (default max: 10000).Controls freshness and memory bounds of decision cache.
Cache key designKey is `<snapshot>:<sha256(deterministic_request)>` with `job_id` cleared before hashing.Reuse across equivalent requests while binding to policy snapshot lineage.
Approval safetyCached response stores empty `approval_ref`; on hit, kernel rebinds approval reference to current job.Prevents stale approval handles from leaking across jobs.
Version guardEach entry stores `policyVersion`; mismatched version causes delete+miss on lookup.Protects against stale entries during concurrent reload windows.
Policy update path`setPolicy()` increments `policyVersion`, updates snapshot history, then clears cache immediately.Hard invalidation on policy change, plus key-space miss from snapshot shift.
Velocity rulesDecision cache is bypassed when active policy contains velocity checks.Avoids incorrect reuse for rate-sensitive decisions.

Implementation examples

Snapshot-prefixed deterministic key

cache_key.go
Go
func cacheKeyForRequest(req *pb.PolicyCheckRequest, snapshot string) string {
  clone := proto.Clone(req).(*pb.PolicyCheckRequest)
  clone.JobId = "" // enable reuse across equivalent jobs

  data, _ := proto.MarshalOptions{Deterministic: true}.Marshal(clone)
  sum := sha256.Sum256(data)
  return snapshot + ":" + hex.EncodeToString(sum[:])
}

Version guard at read time

cache_guard.go
Go
func (s *server) getCachedDecision(key string) *pb.PolicyCheckResponse {
  currentVersion := s.policyVersion.Load()
  entry, ok := s.cache[key]
  if !ok {
    return nil
  }
  if entry.policyVersion != currentVersion {
    delete(s.cache, key)
    return nil
  }
  if time.Now().After(entry.expires) {
    delete(s.cache, key)
    return nil
  }
  return clonePolicyResponse(entry.resp)
}

Ops runbook checks

decision_cache_runbook.sh
Bash
# 1) Verify cache settings
kubectl exec -n cordum deploy/cordum-safety-kernel -- printenv SAFETY_DECISION_CACHE_TTL
kubectl exec -n cordum deploy/cordum-safety-kernel -- printenv SAFETY_DECISION_CACHE_MAX_SIZE

# 2) Roll policy update and confirm invalidation log
kubectl logs deploy/cordum-safety-kernel -n cordum | grep -E "policy updated, cache invalidated|policy snapshot updated"

# 3) Check snapshot history consistency (if grpcurl is available)
grpcurl -plaintext localhost:50051 cordum.protocol.pb.v1.SafetyKernel/ListSnapshots

# 4) Run two equivalent checks with different job_ids; expect same decision snapshot
# but approval_ref bound to each current job_id

Limitations and tradeoffs

  • - More invalidation safeguards add CPU and lock contention on cache paths.
  • - Snapshot-prefixed keys increase churn and can lower hit ratio after frequent policy updates.
  • - Bypassing cache for velocity rules protects correctness but increases latency for those requests.
  • - Oversized TTL reduces load but raises stale-decision blast radius if invalidation signaling fails.

A policy cache that cannot prove freshness is a risk multiplier, not a performance feature.

Next step

Ship this hardening checklist in your next sprint:

  1. 1. Set explicit cache TTL and max size in production env.
  2. 2. Add policy reload integration test that asserts cache invalidation and version increment.
  3. 3. Verify approval-required decisions rebind `approval_ref` per request after cache hit.
  4. 4. Add dashboard panel for policy snapshot changes and cache hit/miss trends.

Continue with LLM Safety Kernel and AI Agent Safety Check Timeout Tuning.

Fast and fresh or fast and wrong

Decision-cache speed only matters if policy freshness is provable under reload and failover.