The production problem
Policy engines get cache layers for speed. Then policy rules change. If invalidation is loose, old decisions keep leaking into new traffic.
For autonomous agents, this is not a cosmetic bug. A stale `allow` can skip an approval gate. A stale `deny` can block revenue traffic. Both are policy incidents.
You need invalidation that is explicit, testable, and resilient to concurrent reload races.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Azure Cache-Aside pattern | Cache staleness risks, invalidation order, and expiration strategy tradeoffs. | No policy-engine semantics like approval references, snapshot lineage, or replica convergence checks. |
| Redis client-side caching | Server-assisted tracking and invalidation messages when tracked keys change. | No treatment of policy reload boundaries, approval identity rebinding, or fail-safe bypass for velocity rules. |
| Google Media CDN cache invalidation | Operational invalidation scope, latency, and origin load impact during purge. | No strategy for correctness-critical policy decisions where each kernel replica holds its own cache map. |
The gap is governance correctness: policy caches must preserve identity-sensitive fields and invalidate on rule lineage changes, not only on wall-clock expiry.
They also skip a practical operations question: how do you prove every replica converged to the same policy snapshot before high-risk traffic ramps?
Invalidation model
| Strategy | Strength | Risk |
|---|---|---|
| TTL-only expiry | Simple and low implementation cost | Serves stale policy after rule change until TTL expires |
| Explicit purge on policy update | Immediate invalidation after reload | Requires reliable fanout to all replicas |
| Snapshot-prefixed key | Automatic miss when snapshot changes | Old entries remain until eviction unless purged |
| Version guard in entry | Protects against races and partial invalidation | Slight lookup overhead per cache hit attempt |
| Sensitive field strip/rebind | Prevents cross-request identity leakage | Requires precise request rehydration logic |
| Replica snapshot parity checks | Detects lagging replicas before stale decisions hit production traffic | Adds operational overhead and alert noise if thresholds are too strict |
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Cache controls | `SAFETY_DECISION_CACHE_TTL` and `SAFETY_DECISION_CACHE_MAX_SIZE` (default max: 10000). | Controls freshness and memory bounds of decision cache. |
| Cache key design | Key is `<snapshot>:<sha256(deterministic_request)>` with `job_id` cleared before hashing. | Reuse across equivalent requests while binding to policy snapshot lineage. |
| Approval safety | Cached response stores empty `approval_ref`; on hit, kernel rebinds approval reference to current job. | Prevents stale approval handles from leaking across jobs. |
| Version guard | Each entry stores `policyVersion`; mismatched version causes delete+miss on lookup. | Protects against stale entries during concurrent reload windows. |
| Policy update path | `setPolicy()` increments `policyVersion`, updates snapshot history, then clears cache immediately. | Hard invalidation on policy change, plus key-space miss from snapshot shift. |
| Cache residency | Decision cache is process-local in each Safety Kernel replica; Redis stores snapshots, not decision entries. | You get low-latency lookups, but update fanout lag can create short-lived replica skew. |
| Capacity eviction | On max size, kernel sweeps expired entries first, then evicts the entry closest to expiry. | Memory stays bounded; near-expiry hot keys can churn under high-cardinality traffic. |
| Velocity rules | Decision cache is bypassed when active policy contains velocity checks. | Avoids incorrect reuse for rate-sensitive decisions. |
Worst-case stale window budget
max_stale_window <= policy_reload_interval + update_fanout_delay + in_flight_request_time # Example with defaults: # 30s (reload interval) + 5s fanout + 1s in-flight ~= 36s worst-case window
Implementation examples
Snapshot-prefixed deterministic key
func cacheKeyForRequest(req *pb.PolicyCheckRequest, snapshot string) string {
clone := proto.Clone(req).(*pb.PolicyCheckRequest)
clone.JobId = "" // enable reuse across equivalent jobs
data, _ := proto.MarshalOptions{Deterministic: true}.Marshal(clone)
sum := sha256.Sum256(data)
return snapshot + ":" + hex.EncodeToString(sum[:])
}Version guard at read time
func (s *server) getCachedDecision(key string) *pb.PolicyCheckResponse {
currentVersion := s.policyVersion.Load()
entry, ok := s.cache[key]
if !ok {
return nil
}
if entry.policyVersion != currentVersion {
delete(s.cache, key)
return nil
}
if time.Now().After(entry.expires) {
delete(s.cache, key)
return nil
}
return clonePolicyResponse(entry.resp)
}Ops runbook checks
# 1) Verify cache settings kubectl exec -n cordum deploy/cordum-safety-kernel -- printenv SAFETY_DECISION_CACHE_TTL kubectl exec -n cordum deploy/cordum-safety-kernel -- printenv SAFETY_DECISION_CACHE_MAX_SIZE # 2) Roll policy update and confirm invalidation log kubectl logs deploy/cordum-safety-kernel -n cordum | grep -E "policy updated, cache invalidated|policy snapshot updated" # 3) Check snapshot history consistency (if grpcurl is available) grpcurl -plaintext localhost:50051 cordum.protocol.pb.v1.SafetyKernel/ListSnapshots # 4) Run two equivalent checks with different job_ids; expect same decision snapshot # but approval_ref bound to each current job_id
Replica snapshot skew probe
pods=$(kubectl get pods -n cordum -l app=cordum-safety-kernel -o name) for pod in $pods; do head_snapshot=$(kubectl exec -n cordum "$pod" -- grpcurl -plaintext 127.0.0.1:50051 cordum.protocol.pb.v1.SafetyKernel/ListSnapshots | jq -r '.snapshots[0] // "none"') echo "$pod $head_snapshot" done | sort -k2 # Gate deploy rollout if head snapshot count > 1
Limitations and tradeoffs
- - More invalidation safeguards add CPU and lock contention on cache paths.
- - Snapshot-prefixed keys increase churn and can lower hit ratio after frequent policy updates.
- - Bypassing cache for velocity rules protects correctness but increases latency for those requests.
- - Oversized TTL reduces load but raises stale-decision blast radius if update fanout lags on one replica.
- - Local decision caches avoid distributed lock cost, but you must alert on cross-replica snapshot drift.
A policy cache that cannot prove freshness is a risk multiplier, not a performance feature.
Next step
Ship this hardening checklist in your next sprint:
- 1. Set explicit cache TTL and max size in production env.
- 2. Add policy reload integration test that asserts cache invalidation and version increment.
- 3. Verify approval-required decisions rebind `approval_ref` per request after cache hit.
- 4. Add dashboard panel for policy snapshot changes, cache hit/miss trends, and velocity bypass counts.
- 5. Block rollout if replicas report different head snapshots for longer than your stale-window budget.
Continue with LLM Safety Kernel and AI Agent Safety Check Timeout Tuning.