The production problem
Partial failure is the default shape of distributed agent workflows. Payment may succeed while inventory rollback fails. Access may be granted while notification delivery times out.
Teams often discover this too late: forward workflow is automated, but compensation is manual, slow, and undocumented.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Saga patterns | Clear continuation vs compensation framing and choreography/orchestration split. | No concrete governance model for high-risk autonomous actions before dispatch. |
| AWS serverless saga with Step Functions | Solid end-to-end orchestration example with explicit compensation branches. | Limited guidance on compensation safety-deny semantics and operator evidence trails. |
| Azure Compensating Transaction pattern | Strong treatment of idempotency, non-reverse ordering, and manual intervention paths. | Not agent-control-plane specific; no pre-dispatch policy gate for autonomous workflows. |
Compensation model for agents
| Layer | Required design | Failure if missing |
|---|---|---|
| Action contract | Every side-effecting action declares compensation or explicit no-compensation reason. | Partial success with no deterministic recovery path. |
| Idempotency | Compensation requests carry stable keys and can be retried safely. | Duplicate refunds, repeated reversals, inconsistent external state. |
| Governance | High-risk actions without valid compensation are denied or require approval. | Agent executes irreversible actions under degraded conditions. |
| Forensics | Timeline records include trigger status, compensation dispatches, skips, and reasons. | Post-incident analysis turns into manual log archaeology. |
Cordum runtime evidence
These behaviors are verified from current scheduler/saga code paths.
| Control | Current behavior | Evidence | Why it matters |
|---|---|---|---|
| Rollback trigger and timeout | `FAILED_FATAL` with workflow ID starts saga rollback goroutine (30s timeout). | engine.go | Recovery is explicit and time-bounded. |
| LIFO compensation storage | Successful forward steps push compensation templates to `saga:<workflow_id>:stack`. | saga.go + scheduler-internals.md | Rollback replays newest side effects first. |
| Rollback lock | Per-workflow lock `saga:<workflow_id>:lock` with 2-minute TTL via `SETNX`. | saga.go | Prevents concurrent rollback races on the same workflow. |
| Compensation dispatch metadata | Compensation jobs are tagged (`saga_compensation=true`, `is_compensation=true`) and forced to CRITICAL priority. | saga.go | Improves routing and observability for recovery traffic. |
| Compensation idempotency | If no explicit key exists, scheduler derives one with `saga:<hash>` format. | saga.go | Enables safe replay of compensation paths. |
| Safety behavior during rollback | Safety deny skips compensation; safety unavailable proceeds with warning. | saga.go | Avoids hard blocking rollback, but can leave residual manual work. |
Implementation examples
Rollback loop with lock and LIFO stack (Go)
func (s *SagaManager) Rollback(ctx context.Context, workflowID string) error {
lockKey := fmt.Sprintf("saga:%s:lock", workflowID)
ok, _ := s.redis.SetNX(ctx, lockKey, "1", 2*time.Minute).Result()
if !ok { return nil } // rollback already active
for {
data, err := s.redis.LPop(ctx, fmt.Sprintf("saga:%s:stack", workflowID)).Bytes()
if err == redis.Nil { break }
// unmarshal + dispatch compensation request
}
return nil
}Compensation idempotency key derivation (Go)
func compensationIdempotencyKey(base *pb.JobRequest, comp *pb.Compensation) string {
// Capability comes from the compensation, falling back to the forward job.
capability := comp.Meta.GetCapability()
if capability == "" {
capability = base.Meta.GetCapability()
}
step := ""
if base.StepIndex != 0 {
step = fmt.Sprintf("%d", base.StepIndex)
}
seed := strings.Trim(strings.Join([]string{
base.WorkflowId, base.JobId, comp.Topic, capability, step,
}, "|"), "|")
sum := sha256.Sum256([]byte(seed))
return "saga:" + hex.EncodeToString(sum[:16]) // stable key → safe replay
}Rollback evidence record (JSON)
{
"workflow_id": "wf_19c1",
"trigger_status": "FAILED_FATAL",
"rollback_started": true,
"lock_key": "saga:wf_19c1:lock",
"compensation_dispatched": 3,
"compensation_skipped_denied": 1,
"compensation_failed_publish": 0
}Limitations and tradeoffs
- - Some actions are not truly reversible. Compensation then becomes mitigation, not restore.
- - Safety-denied compensation steps can leave residual side effects that require manual operations.
- - Rollback is time-bounded (30s in scheduler flow); long compensations may need separate recovery playbooks.
- - Compensation logic must be tested as hard as forward execution or incident recovery will fail under load.
FAQ
How do you roll back an AI agent action?
You do not undo it — you run a second, compensating workflow. Each side-effecting step declares a compensation action up front; when a later step fails fatally, the orchestrator replays those compensations in reverse (LIFO) order. In Cordum, a successful step pushes its compensation template onto a per-workflow Redis stack, and a FAILED_FATAL result triggers the saga manager to pop and dispatch each compensation as a CRITICAL-priority job.
What is the saga pattern for AI agents?
The saga pattern models a long-running, multi-step workflow as a sequence of local transactions, each paired with a compensating transaction that semantically reverses it. For AI agents it means every action an agent takes against an external system (a refund, an access grant, a ticket) carries a defined way to undo or mitigate it, so partial failures resolve to a consistent state instead of leaving orphaned side effects.
When does Cordum trigger a saga rollback?
Cordum's scheduler starts a rollback when a job in a workflow returns JOB_STATUS_FAILED_FATAL and the workflow ID is known. It launches a bounded rollback goroutine (30-second context), acquires a per-workflow Redis lock (saga:<workflow_id>:lock, 2-minute TTL via SETNX) so two rollbacks cannot race, then pops compensation templates off saga:<workflow_id>:stack and dispatches them newest-first.
What happens if a compensation step is itself unsafe?
Compensation dispatch is gated by the Safety Kernel. If the kernel returns DENY for a compensation job, that step is skipped and logged rather than forced through — which can leave residual side effects for an operator to resolve. If the kernel is unavailable, the compensation proceeds with a warning so rollback is not hard-blocked by a control-plane outage.
How is compensation made idempotent?
If a compensation does not carry an explicit idempotency key, Cordum derives a stable one by hashing the workflow ID, originating job ID, compensation topic, capability, and step index (saga:<sha256-prefix>). Replays of the same compensation therefore collapse to one effect, preventing duplicate refunds or repeated reversals.
Next step
Run this rollout in one sprint:
- 1. Classify side-effecting topics as compensatable, mitigatable, or irreversible.
- 2. Block production dispatch when compensation metadata is missing.
- 3. Run weekly `FAILED_FATAL` drills and verify rollback evidence completeness.
- 4. Add manual incident branch for compensation-denied and compensation-timeout outcomes.
Continue with Approval Workflows for Autonomous AI Agents and AI Agent Incident Report, or see how compensation fits the broader control plane in AI Agent Governance for Production.