Name: Cordum
Author: Cordum

The production problem

Policy engines get cache layers for speed. Then policy rules change. If invalidation is loose, old decisions keep leaking into new traffic.

For autonomous agents, this is not a cosmetic bug. A stale `allow` can skip an approval gate. A stale `deny` can block revenue traffic. Both are policy incidents.

You need invalidation that is explicit, testable, and resilient to concurrent reload races.

What top results miss

Source	Strong coverage	Missing piece
Azure Cache-Aside pattern	Cache staleness risks, invalidation order, and expiration strategy tradeoffs.	No policy-engine semantics like approval references and snapshot-aware governance.
Redis client-side caching	Server-assisted tracking and invalidation messages when tracked keys change.	No treatment of policy reload boundaries or per-request identity rebinding.
Google Media CDN cache invalidation	Operational invalidation scope, latency, and origin load impact during purge.	No strategy for correctness-critical policy decisions in control-plane hot paths.

The gap is governance correctness: policy caches must preserve identity-sensitive fields and invalidate on rule lineage changes, not only on wall-clock expiry.

Invalidation model

Strategy	Strength	Risk
TTL-only expiry	Simple and low implementation cost	Serves stale policy after rule change until TTL expires
Explicit purge on policy update	Immediate invalidation after reload	Requires reliable fanout to all replicas
Snapshot-prefixed key	Automatic miss when snapshot changes	Old entries remain until eviction unless purged
Version guard in entry	Protects against races and partial invalidation	Slight lookup overhead per cache hit attempt
Sensitive field strip/rebind	Prevents cross-request identity leakage	Requires precise request rehydration logic

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Cache controls	`SAFETY_DECISION_CACHE_TTL` and `SAFETY_DECISION_CACHE_MAX_SIZE` (default max: 10000).	Controls freshness and memory bounds of decision cache.
Cache key design	Key is `<snapshot>:<sha256(deterministic_request)>` with `job_id` cleared before hashing.	Reuse across equivalent requests while binding to policy snapshot lineage.
Approval safety	Cached response stores empty `approval_ref`; on hit, kernel rebinds approval reference to current job.	Prevents stale approval handles from leaking across jobs.
Version guard	Each entry stores `policyVersion`; mismatched version causes delete+miss on lookup.	Protects against stale entries during concurrent reload windows.
Policy update path	`setPolicy()` increments `policyVersion`, updates snapshot history, then clears cache immediately.	Hard invalidation on policy change, plus key-space miss from snapshot shift.
Velocity rules	Decision cache is bypassed when active policy contains velocity checks.	Avoids incorrect reuse for rate-sensitive decisions.

Implementation examples

Snapshot-prefixed deterministic key

cache_key.go

func cacheKeyForRequest(req *pb.PolicyCheckRequest, snapshot string) string {
  clone := proto.Clone(req).(*pb.PolicyCheckRequest)
  clone.JobId = "" // enable reuse across equivalent jobs

  data, _ := proto.MarshalOptions{Deterministic: true}.Marshal(clone)
  sum := sha256.Sum256(data)
  return snapshot + ":" + hex.EncodeToString(sum[:])
}

Version guard at read time

cache_guard.go

func (s *server) getCachedDecision(key string) *pb.PolicyCheckResponse {
  currentVersion := s.policyVersion.Load()
  entry, ok := s.cache[key]
  if !ok {
    return nil
  }
  if entry.policyVersion != currentVersion {
    delete(s.cache, key)
    return nil
  }
  if time.Now().After(entry.expires) {
    delete(s.cache, key)
    return nil
  }
  return clonePolicyResponse(entry.resp)
}

Ops runbook checks

decision_cache_runbook.sh

Bash

# 1) Verify cache settings
kubectl exec -n cordum deploy/cordum-safety-kernel -- printenv SAFETY_DECISION_CACHE_TTL
kubectl exec -n cordum deploy/cordum-safety-kernel -- printenv SAFETY_DECISION_CACHE_MAX_SIZE

# 2) Roll policy update and confirm invalidation log
kubectl logs deploy/cordum-safety-kernel -n cordum | grep -E "policy updated, cache invalidated|policy snapshot updated"

# 3) Check snapshot history consistency (if grpcurl is available)
grpcurl -plaintext localhost:50051 cordum.protocol.pb.v1.SafetyKernel/ListSnapshots

# 4) Run two equivalent checks with different job_ids; expect same decision snapshot
# but approval_ref bound to each current job_id

Limitations and tradeoffs

- More invalidation safeguards add CPU and lock contention on cache paths.
- Snapshot-prefixed keys increase churn and can lower hit ratio after frequent policy updates.
- Bypassing cache for velocity rules protects correctness but increases latency for those requests.
- Oversized TTL reduces load but raises stale-decision blast radius if invalidation signaling fails.

A policy cache that cannot prove freshness is a risk multiplier, not a performance feature.

Next step

Ship this hardening checklist in your next sprint:

1. Set explicit cache TTL and max size in production env.
2. Add policy reload integration test that asserts cache invalidation and version increment.
3. Verify approval-required decisions rebind `approval_ref` per request after cache hit.
4. Add dashboard panel for policy snapshot changes and cache hit/miss trends.

Continue with LLM Safety Kernel and AI Agent Safety Check Timeout Tuning.

AI Agent Policy Decision Cache Invalidation