The production reliability problem
Teams often pick one runtime and expect it to solve everything: retries, approvals, policy, observability, and rollback. That design looks efficient until the first high-impact failure.
Durable execution and governance are separate concerns. One keeps workflows alive. The other decides whether actions are permitted. Production agents need both concerns represented explicitly.
Common failure mode
Retry logic is centralized, but policy checks are scattered. During incidents, operators cannot explain why an action executed.
What top sources cover vs miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Temporal Durable Execution Technical Guide | Excellent breakdown of consistency problems, retries, compensation, and long-running workflow behavior in distributed systems. | No native policy decision layer for pre-execution ALLOW or DENY decisions on AI agent actions. |
| LangGraph Durable Execution Docs | Strong details on checkpointers, durability modes, task boundaries, and resume behavior after exceptions. | No cross-system governance model for approvals and immutable policy snapshot binding before side effects. |
| LangChain Frameworks, Runtimes, and Harnesses | Clear categorization of frameworks vs runtimes and explicit mention of Temporal as a runtime option. | Does not provide actionable reliability and governance integration guidance for regulated production environments. |
Mental model
Execution reliability engine. Focuses on completion guarantees, retries, and workflow state progression.
Governance control plane. Focuses on pre-execution policy decisions, approvals, and auditable action contracts.
Failure semantics comparison
| Dimension | Temporal | Cordum |
|---|---|---|
| Primary objective | Complete workflows despite failures | Enforce governance before execution |
| Retry model | Activity retries with policy and backoff | Protocol-level result classes (`FAILED_RETRYABLE` vs `FAILED_FATAL`) |
| Approval semantics | Custom signals and workflow logic | First-class `REQUIRE_APPROVAL` state in policy flow |
| Rollback model | Compensations in workflow code | Saga rollback triggered on fatal outcomes |
| Policy location | Application-defined | Safety Kernel decisions at submit and dispatch |
| Best at | Durability and orchestration correctness | Operational guardrails, approvals, and auditability |
How to combine both
A practical pattern is straightforward: Temporal orchestrates workflow lifetime, while Cordum evaluates policy before high-impact steps.
- -Temporal controls retries, timers, and long waits.
- -Cordum returns `ALLOW`, `DENY`, `REQUIRE_APPROVAL`, or `ALLOW_WITH_CONSTRAINTS` before dispatch.
- -Fatal outcomes trigger rollback and compensation paths deterministically.
Working code patterns
// Temporal workflow (TypeScript)
import { proxyActivities } from "@temporalio/workflow";
const { runPlan, applyChange } = proxyActivities<{
runPlan(input: unknown): Promise<{ step: string }>;
applyChange(input: unknown): Promise<void>;
}>({ startToCloseTimeout: "2 minute" });
export async function DeployWorkflow(input: { service: string }) {
const plan = await runPlan(input);
await applyChange(plan);
return { status: "completed" };
}# Cordum input policy
version: v1
rules:
- id: require-approval-prod-write
match:
topic: "job.deploy.apply"
labels:
env: prod
decision: REQUIRE_APPROVAL
- id: deny-destructive-shell
match:
topic: "job.exec.shell"
labels:
command_class: destructive
env: prod
decision: DENY
- id: allow-readonly-observe
match:
topic: "job.observe.read"
decision: ALLOW// Result contract used by orchestration layer JobResult.status: JOB_STATUS_SUCCEEDED JOB_STATUS_FAILED_RETRYABLE JOB_STATUS_FAILED_FATAL JOB_STATUS_DENIED // Operational behavior FAILED_RETRYABLE -> retry path FAILED_FATAL -> compensation path DENIED -> stop and report policy violation
For a broader architecture view, see LangGraph vs Temporal vs Cordum.
Limitations and tradeoffs
Integration overhead
Combining orchestration and governance adds initial complexity, but reduces long-term incident ambiguity.
Policy tuning work
Overly strict rules can block healthy automation. Teams need iterative tuning with real incident data.
Operator discipline
Clear ownership boundaries are mandatory. Without them, retry and approval logic drifts across systems.