Policy decisions for every job.
The Safety Kernel is a gRPC service that evaluates every job before dispatch. Decisions are recorded with a reason, constraints, matched rule, and policy snapshot hash. Checks complete in < 5ms. On outage, schedulers fail-closed by default.
Job proceeds immediately. Decision recorded with rule_id and snapshot hash.
Job rejected. Reason logged. DLQ entry created with matched rule_id.
Job paused until human approves. Bound to policy snapshot + job hash.
Job runs with enforced budget, sandbox, toolchain, or diff limits.
Rate-limited. Scheduler retries after 5s backoff.
Safety Kernel unreachable. Falls back to last-known-good policy.
service SafetyKernel {
rpc Check(PolicyCheckRequest)
returns (PolicyCheckResponse);
rpc Evaluate(PolicyCheckRequest)
returns (PolicyCheckResponse);
rpc Explain(PolicyCheckRequest)
returns (PolicyCheckResponse);
rpc Simulate(PolicyCheckRequest)
returns (PolicyCheckResponse);
rpc ListSnapshots(ListSnapshotsRequest)
returns (ListSnapshotsResponse);
}- Check & Evaluate for runtime decisions
- Explain decisions with matched rules and reasoning
- Simulate dry-run policy changes
- Snapshot hashing with hot reload every 30s
Policy Rule Syntax
Rules are defined in YAML with match conditions and decisions. First matching rule wins. No match defaults to ALLOW.
version: v1
default_tenant: default
tenants:
default:
allow_topics:
- "job.incident.*"
- "job.read.*"
deny_topics:
- "sys.*"
- "job.admin.*"
mcp:
allow_servers: [claude, gpt-4]
deny_tools: [delete_database, drop_table]
prod:
allow_topics:
- "job.prod.*"
- "job.infra.*"
deny_topics:
- "job.experimental"
rules:
- id: deny-prod-from-service
decision: deny
reason: "Only humans can modify production"
match:
tenants: [prod]
actor_types: [service]
topics: ["job.prod.*"]
- id: require-approval-destructive
decision: require_approval
reason: "Destructive operations require human approval"
match:
risk_tags: [destructive, write]
topics: ["job.delete.*", "job.drop.*"]
- id: constrain-heavy-compute
decision: allow_with_constraints
match:
risk_tags: [heavy-compute]
constraints:
budgets:
max_runtime_ms: 3600000
max_retries: 3
max_artifact_bytes: 1073741824
sandbox:
isolated: true
network_allowlist: [github.com, api.example.com]
fs_read_write: [/tmp/work]
- id: constrain-patches
decision: allow_with_constraints
match:
capability: "*.patch.*"
constraints:
diff:
max_lines: 500
deny_path_globs: ["/etc/*", "/var/secrets/*"]
- id: secrets-require-approval
decision: require_approval
reason: "Jobs handling secrets need human approval"
match:
secrets_present: trueMatch Conditions
All conditions in a rule must match for the rule to apply. Fields are optional — omitted fields match everything.
tenantsTenant IDs (case-insensitive)topicsGlob patterns (e.g., "job.db.*")capabilitiesCapability labels (case-insensitive)risk_tagsRisk tags — input must contain ANYrequiresDependencies — input must contain ALLpack_idsPack identifiersactor_idsPrincipal/user IDsactor_types"human" or "service"labelsExact key-value map (all must match)secrets_presentBoolean — true if job handles secretsmcpMCP server/tool/resource/action rules* matches any characters except /, ? matches a single character.MCP Filtering
Model Context Protocol integrations are filtered per-tenant. Deny-first semantics: if a value is in the deny list, the job is denied regardless of the allow list.
tenants:
default:
mcp:
allow_servers: [claude, local-tools]
deny_servers: [untrusted-llm]
allow_tools: [search, summarize]
deny_tools: [delete_database]
allow_resources: [docs://*]
deny_resources: [secrets://*]
allow_actions: [read, create]
deny_actions: [delete]MCP Labels
MCP context is extracted from job labels (multiple naming conventions supported):
mcp.server/mcp_server/mcpServermcp.tool/mcp_tool/mcpToolmcp.resource/mcp_resource/mcpResourcemcp.action/mcp_action/mcpAction
Constraint Types
When a rule returns ALLOW_WITH_CONSTRAINTS, the Safety Kernel attaches structured limits that the scheduler and workers enforce.
Budget Constraints
max_runtime_msmax_retriesmax_artifact_bytesmax_concurrent_jobsSandbox Constraints
isolatednetwork_allowlistfs_read_onlyfs_read_writeToolchain Constraints
allowed_toolsallowed_commandsDiff Constraints
max_filesmax_linesdeny_path_globs# Response from Safety Kernel:
decision: ALLOW_WITH_CONSTRAINTS
rule_id: constrain-heavy-compute
reason: "Heavy compute job constrained"
policy_snapshot: "v1:a3f8c2|cfg:9d1e4b"
constraints:
budgets:
max_runtime_ms: 3600000 # 1 hour max
max_retries: 3
max_artifact_bytes: 1073741824 # 1 GB
max_concurrent_jobs: 5
sandbox:
isolated: true
network_allowlist:
- github.com
- api.example.com
fs_read_write:
- /tmp/work
diff:
max_lines: 500
deny_path_globs:
- "/etc/*"
- "/var/secrets/*"Snapshots & Hot Reload
Policies are versioned using SHA256 hashes. The kernel reloads policy every 30 seconds and falls back to the last-known-good snapshot on failure.
Hot Reload
- Periodic reload every 30s (configurable via
SAFETY_POLICY_RELOAD_INTERVAL) - Loads from file (
SAFETY_POLICY_PATH) or URL (SAFETY_POLICY_URL) - Merges with config service bundles from Redis
- Compares SHA256 hash — only replaces if changed
- Atomic swap under RWMutex — zero-downtime
- Maintains last 10 snapshots for rollback/audit
Snapshot Format
v1:a3f8c2e9...Format: {version}:{sha256}
cfg:9d1e4b7f...Hash of all bundle keys + content (sorted)
v1:a3f8c2|cfg:9d1e4bBase + fragments separated by |
Last-Known-Good Fallback
If a policy reload fails (parse error, network issue, signature mismatch), the kernel continues with the current policy. Error is logged but no decision behavior changes.
Approval Binding
Approvals are bound to the policy snapshot hash + job request hash. If policy changes between approval and execution, the approval is invalidated and re-evaluation occurs.
Policy Signature Verification
Ed25519 signatures ensure policy integrity. Required in production unless explicitly disabled.
- → Public key:
SAFETY_POLICY_PUBLIC_KEY(base64 or hex) - → Signature from: env var, file path, or adjacent
.sigfile - → Enforcement:
SAFETY_POLICY_SIGNATURE_REQUIRED=true(default in production) - → URL safety: private IP addresses blocked unless
SAFETY_POLICY_URL_ALLOW_PRIVATE=1
Simulate & Explain
Test policy changes before deployment with Simulate. Debug unexpected decisions with Explain. Both use the same evaluation pipeline as Check — no state changes are made.
# Test a policy against a sample job (API-only)
curl -X POST http://localhost:8081/api/v1/policy/simulate \
-H "X-API-Key: $CORDUM_API_KEY" \
-H "X-Tenant-ID: default" \
-H "Content-Type: application/json" \
-d '{
"topic": "job.db.delete",
"tenant": "prod",
"risk_tags": ["destructive","write"],
"actor_type": "service"
}'
# Response:
# { "decision": "DENY",
# "rule_id": "deny-prod-from-service",
# "reason": "Only humans can modify production" }# Explain why a job received a decision (API-only)
curl -X POST http://localhost:8081/api/v1/policy/explain \
-H "X-API-Key: $CORDUM_API_KEY" \
-H "X-Tenant-ID: default" \
-H "Content-Type: application/json" \
-d '{
"topic": "job.prod.deploy",
"tenant": "prod",
"actor_type": "human",
"risk_tags": ["write"]
}'
# Response:
# { "decision": "REQUIRE_APPROVAL",
# "rule_id": "require-approval-destructive",
# "reason": "Destructive operations require
# human approval" }curl -X POST http://localhost:8081/api/v1/policy/simulate \
-H "X-API-Key: <token>" \
-H "Content-Type: application/json" \
-d '{
"topic": "job.db.delete",
"tenant": "prod",
"meta": {
"actor_type": "SERVICE",
"risk_tags": ["destructive", "write"],
"capability": "db.table.drop"
}
}'
# Response:
# {
# "decision": "DENY",
# "reason": "Only humans can modify production",
# "rule_id": "deny-prod-from-service",
# "policy_snapshot": "v1:a3f8c2|cfg:9d1e4b"
# }Policy Bundles
Compose policies from multiple sources: local files, URLs, and config service fragments. Packs can inject policy fragments that are merged with the base policy.
Merge Strategy
- Rules: Appended from all bundles (all rules apply)
- Tenants: Deep-merged (topic lists appended, MCP lists appended)
- max_concurrent: Takes minimum of non-zero values
- default_tenant: Later bundle overrides only if base is empty
Bundle Management
# Upsert bundle content (admin)
curl -X PUT http://localhost:8081/api/v1/policy/bundles/secops~default \
-H "X-API-Key: $CORDUM_API_KEY" \
-H "X-Tenant-ID: default" \
-H "Content-Type: application/json" \
-d '{"content":"rules: []","enabled":true,"author":"secops"}'
# List active bundles
curl http://localhost:8081/api/v1/policy/bundles \
-H "X-API-Key: $CORDUM_API_KEY" \
-H "X-Tenant-ID: default"
# List snapshots
curl http://localhost:8081/api/v1/policy/bundles/snapshots \
-H "X-API-Key: $CORDUM_API_KEY" \
-H "X-Tenant-ID: default"
# Rollback to previous snapshot
curl -X POST http://localhost:8081/api/v1/policy/rollback \
-H "X-API-Key: $CORDUM_API_KEY" \
-H "X-Tenant-ID: default" \
-H "Content-Type: application/json" \
-d '{"snapshot_id": "<snapshot_id>","author":"secops","message":"rollback"}'Config Service Integration
The Safety Kernel pulls policy bundles from the config service in Redis. Each bundle can be a YAML string or an object with an enabled flag.
SAFETY_POLICY_CONFIG_SCOPEConfig scope (default: system)SAFETY_POLICY_CONFIG_IDDocument ID (default: policy)SAFETY_POLICY_CONFIG_KEYBundles key (default: bundles)SAFETY_POLICY_CONFIG_DISABLEDisable config service (default: false)Remediations
Rules can include remediation suggestions — safer alternatives returned alongside a DENY decision. Clients can offer these as one-click alternatives.
rules:
- id: deny-uncontrolled-delete
decision: deny
reason: "Uncontrolled deletion is dangerous"
match:
topics: ["job.db.delete"]
remediations:
- id: use-archive
title: "Archive instead of delete"
summary: "Mark records as archived"
replacement_topic: job.db.archive
- id: use-soft-delete
title: "Soft delete with recovery"
summary: "Reversible soft-delete with 30-day window"
replacement_topic: job.db.soft_deleteDecision Caching
Identical policy checks can be cached to reduce gRPC round-trips. Cache keys are deterministically generated from the request (excluding job_id) and scoped to the current policy snapshot.
- TTL:
SAFETY_DECISION_CACHE_TTL(default: 0 = disabled, e.g. "5s", "1m") - Key format:
{snapshot}:{sha256_of_request} - Invalidation: New snapshot invalidates all cached entries
- Exclusions: job_id excluded from key; approval_ref added per-job on retrieval
Evaluation Flow
Every policy check follows a deterministic pipeline from request to decision.
PolicyCheckRequest │ ├─ Extract tenant, topic, labels, metadata │ ├─ Check cache (if TTL > 0) ──→ return cached │ ├─ Validate: topic required, must start with "job." │ └─ Tenant defaults to default_tenant or "default" │ ├─ Rule evaluation (first match wins) │ └─ No rules? Fall back to legacy tenant config │ └─ No match? Default ALLOW │ ├─ MCP validation (deny-first) │ └─ May override rule decision to DENY │ ├─ Effective config override (per-job restrictions) │ └─ May override to DENY │ ├─ Attach constraints (if decision != DENY) │ └─ Cache & return PolicyCheckResponse
