The production problem
Policy engines get cache layers for speed. Then policy rules change. If invalidation is loose, old decisions keep leaking into new traffic.
For autonomous agents, this is not a cosmetic bug. A stale `allow` can skip an approval gate. A stale `deny` can block revenue traffic. Both are policy incidents.
You need invalidation that is explicit, testable, and resilient to concurrent reload races.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Azure Cache-Aside pattern | Cache staleness risks, invalidation order, and expiration strategy tradeoffs. | No policy-engine semantics like approval references and snapshot-aware governance. |
| Redis client-side caching | Server-assisted tracking and invalidation messages when tracked keys change. | No treatment of policy reload boundaries or per-request identity rebinding. |
| Google Media CDN cache invalidation | Operational invalidation scope, latency, and origin load impact during purge. | No strategy for correctness-critical policy decisions in control-plane hot paths. |
The gap is governance correctness: policy caches must preserve identity-sensitive fields and invalidate on rule lineage changes, not only on wall-clock expiry.
Invalidation model
| Strategy | Strength | Risk |
|---|---|---|
| TTL-only expiry | Simple and low implementation cost | Serves stale policy after rule change until TTL expires |
| Explicit purge on policy update | Immediate invalidation after reload | Requires reliable fanout to all replicas |
| Snapshot-prefixed key | Automatic miss when snapshot changes | Old entries remain until eviction unless purged |
| Version guard in entry | Protects against races and partial invalidation | Slight lookup overhead per cache hit attempt |
| Sensitive field strip/rebind | Prevents cross-request identity leakage | Requires precise request rehydration logic |
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Cache controls | `SAFETY_DECISION_CACHE_TTL` and `SAFETY_DECISION_CACHE_MAX_SIZE` (default max: 10000). | Controls freshness and memory bounds of decision cache. |
| Cache key design | Key is `<snapshot>:<sha256(deterministic_request)>` with `job_id` cleared before hashing. | Reuse across equivalent requests while binding to policy snapshot lineage. |
| Approval safety | Cached response stores empty `approval_ref`; on hit, kernel rebinds approval reference to current job. | Prevents stale approval handles from leaking across jobs. |
| Version guard | Each entry stores `policyVersion`; mismatched version causes delete+miss on lookup. | Protects against stale entries during concurrent reload windows. |
| Policy update path | `setPolicy()` increments `policyVersion`, updates snapshot history, then clears cache immediately. | Hard invalidation on policy change, plus key-space miss from snapshot shift. |
| Velocity rules | Decision cache is bypassed when active policy contains velocity checks. | Avoids incorrect reuse for rate-sensitive decisions. |
Implementation examples
Snapshot-prefixed deterministic key
func cacheKeyForRequest(req *pb.PolicyCheckRequest, snapshot string) string {
clone := proto.Clone(req).(*pb.PolicyCheckRequest)
clone.JobId = "" // enable reuse across equivalent jobs
data, _ := proto.MarshalOptions{Deterministic: true}.Marshal(clone)
sum := sha256.Sum256(data)
return snapshot + ":" + hex.EncodeToString(sum[:])
}Version guard at read time
func (s *server) getCachedDecision(key string) *pb.PolicyCheckResponse {
currentVersion := s.policyVersion.Load()
entry, ok := s.cache[key]
if !ok {
return nil
}
if entry.policyVersion != currentVersion {
delete(s.cache, key)
return nil
}
if time.Now().After(entry.expires) {
delete(s.cache, key)
return nil
}
return clonePolicyResponse(entry.resp)
}Ops runbook checks
# 1) Verify cache settings kubectl exec -n cordum deploy/cordum-safety-kernel -- printenv SAFETY_DECISION_CACHE_TTL kubectl exec -n cordum deploy/cordum-safety-kernel -- printenv SAFETY_DECISION_CACHE_MAX_SIZE # 2) Roll policy update and confirm invalidation log kubectl logs deploy/cordum-safety-kernel -n cordum | grep -E "policy updated, cache invalidated|policy snapshot updated" # 3) Check snapshot history consistency (if grpcurl is available) grpcurl -plaintext localhost:50051 cordum.protocol.pb.v1.SafetyKernel/ListSnapshots # 4) Run two equivalent checks with different job_ids; expect same decision snapshot # but approval_ref bound to each current job_id
Limitations and tradeoffs
- - More invalidation safeguards add CPU and lock contention on cache paths.
- - Snapshot-prefixed keys increase churn and can lower hit ratio after frequent policy updates.
- - Bypassing cache for velocity rules protects correctness but increases latency for those requests.
- - Oversized TTL reduces load but raises stale-decision blast radius if invalidation signaling fails.
A policy cache that cannot prove freshness is a risk multiplier, not a performance feature.
Next step
Ship this hardening checklist in your next sprint:
- 1. Set explicit cache TTL and max size in production env.
- 2. Add policy reload integration test that asserts cache invalidation and version increment.
- 3. Verify approval-required decisions rebind `approval_ref` per request after cache hit.
- 4. Add dashboard panel for policy snapshot changes and cache hit/miss trends.
Continue with LLM Safety Kernel and AI Agent Safety Check Timeout Tuning.