The production problem
Partial failure is the default shape of distributed agent workflows. Payment may succeed while inventory rollback fails. Access may be granted while notification delivery times out.
Teams often discover this too late: forward workflow is automated, but compensation is manual, slow, and undocumented.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Saga patterns | Clear continuation vs compensation framing and choreography/orchestration split. | No concrete governance model for high-risk autonomous actions before dispatch. |
| AWS serverless saga with Step Functions | Solid end-to-end orchestration example with explicit compensation branches. | Limited guidance on compensation safety-deny semantics and operator evidence trails. |
| Azure Compensating Transaction pattern | Strong treatment of idempotency, non-reverse ordering, and manual intervention paths. | Not agent-control-plane specific; no pre-dispatch policy gate for autonomous workflows. |
Compensation model for agents
| Layer | Required design | Failure if missing |
|---|---|---|
| Action contract | Every side-effecting action declares compensation or explicit no-compensation reason. | Partial success with no deterministic recovery path. |
| Idempotency | Compensation requests carry stable keys and can be retried safely. | Duplicate refunds, repeated reversals, inconsistent external state. |
| Governance | High-risk actions without valid compensation are denied or require approval. | Agent executes irreversible actions under degraded conditions. |
| Forensics | Timeline records include trigger status, compensation dispatches, skips, and reasons. | Post-incident analysis turns into manual log archaeology. |
Cordum runtime evidence
These behaviors are verified from current scheduler/saga code paths.
| Control | Current behavior | Evidence | Why it matters |
|---|---|---|---|
| Rollback trigger and timeout | `FAILED_FATAL` with workflow ID starts saga rollback goroutine (30s timeout). | engine.go | Recovery is explicit and time-bounded. |
| LIFO compensation storage | Successful forward steps push compensation templates to `saga:<workflow_id>:stack`. | saga.go + scheduler-internals.md | Rollback replays newest side effects first. |
| Rollback lock | Per-workflow lock `saga:<workflow_id>:lock` with 2-minute TTL via `SETNX`. | saga.go | Prevents concurrent rollback races on the same workflow. |
| Compensation dispatch metadata | Compensation jobs are tagged (`saga_compensation=true`, `is_compensation=true`) and forced to CRITICAL priority. | saga.go | Improves routing and observability for recovery traffic. |
| Compensation idempotency | If no explicit key exists, scheduler derives one with `saga:<hash>` format. | saga.go | Enables safe replay of compensation paths. |
| Safety behavior during rollback | Safety deny skips compensation; safety unavailable proceeds with warning. | saga.go | Avoids hard blocking rollback, but can leave residual manual work. |
Implementation examples
Rollback loop with lock and LIFO stack (Go)
func (s *SagaManager) Rollback(ctx context.Context, workflowID string) error {
lockKey := fmt.Sprintf("saga:%s:lock", workflowID)
ok, _ := s.redis.SetNX(ctx, lockKey, "1", 2*time.Minute).Result()
if !ok { return nil } // rollback already active
for {
data, err := s.redis.LPop(ctx, fmt.Sprintf("saga:%s:stack", workflowID)).Bytes()
if err == redis.Nil { break }
// unmarshal + dispatch compensation request
}
return nil
}Compensation idempotency key derivation (Go)
func compensationIdempotencyKey(base *pb.JobRequest, comp *pb.Compensation) string {
seed := strings.Join([]string{
base.WorkflowId, base.JobId, comp.Topic, base.Meta.Capability,
}, "|")
sum := sha256.Sum256([]byte(seed))
return "saga:" + hex.EncodeToString(sum[:16])
}Rollback evidence record (JSON)
{
"workflow_id": "wf_19c1",
"trigger_status": "FAILED_FATAL",
"rollback_started": true,
"lock_key": "saga:wf_19c1:lock",
"compensation_dispatched": 3,
"compensation_skipped_denied": 1,
"compensation_failed_publish": 0
}Limitations and tradeoffs
- - Some actions are not truly reversible. Compensation then becomes mitigation, not restore.
- - Safety-denied compensation steps can leave residual side effects that require manual operations.
- - Rollback is time-bounded (30s in scheduler flow); long compensations may need separate recovery playbooks.
- - Compensation logic must be tested as hard as forward execution or incident recovery will fail under load.
Next step
Run this rollout in one sprint:
- 1. Classify side-effecting topics as compensatable, mitigatable, or irreversible.
- 2. Block production dispatch when compensation metadata is missing.
- 3. Run weekly `FAILED_FATAL` drills and verify rollback evidence completeness.
- 4. Add manual incident branch for compensation-denied and compensation-timeout outcomes.
Continue with Approval Workflows for Autonomous AI Agents and AI Agent Incident Report.