Name: Cordum
Author: Cordum

The production problem

Error code migrations rarely fail loudly.

They fail by collapsing distinct failure classes into one fallback bucket.

In Cordum terms, that bucket is `ERROR_CODE_UNSPECIFIED`. The system still runs. Your charts still move. Your postmortems get worse.

If your retry and alert policy depends on error class, this is not cosmetic drift. It is control-plane signal loss.

What top results cover and miss

Source	Strong coverage	Missing piece
gRPC Status Codes	A stable status taxonomy with clear client and server semantics.	No migration pattern for mixed payloads that carry both legacy strings and new enums.
Google AIP-193 (Errors)	Canonical API error model and machine-readable error detail guidance.	No concrete runtime strategy for string-to-enum backfill in an existing event bus.
RFC 9457 Problem Details	Interoperable HTTP problem payload structure and extensibility.	Does not address protobuf enum migration and DLQ telemetry impacts inside a distributed control plane.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Result normalization	`handleJobResult` maps legacy `ErrorCode` to `ErrorCodeEnum` when enum is unset.	Old producers remain compatible while newer consumers can read enum values.
Mapping scope	Known strings include `approval_rejected`, `policy_denied`, `policy_violation`, `max_scheduling_retries`, `timeout`, and `permission_denied`.	Coverage is explicit and testable, but incomplete mappings degrade to UNSPECIFIED.
Fallback behavior	Unknown strings default to `ERROR_CODE_UNSPECIFIED`.	No runtime breakage, but classification precision drops.
DLQ path	DLQ emissions also populate `ErrorCodeEnum` from reason code mapping.	Terminal failures can keep structured classification even in dead-letter workflows.
Test enforcement	`TestMapStringToErrorCode` verifies expected mappings plus unknown fallback behavior.	Mapping regressions are catchable before release.

Mapping code paths

Normalization in result handler

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
if res.ErrorCodeEnum == pb.ErrorCode_ERROR_CODE_UNSPECIFIED && res.ErrorCode != "" {
  res.ErrorCodeEnum = mapStringToErrorCode(res.ErrorCode)
}

Legacy string mapping table

core/controlplane/scheduler/engine.go

// core/controlplane/scheduler/engine.go (excerpt)
func mapStringToErrorCode(code string) pb.ErrorCode {
  switch code {
  case "approval_rejected", "policy_denied":
    return pb.ErrorCode_ERROR_CODE_SAFETY_DENIED
  case "policy_violation":
    return pb.ErrorCode_ERROR_CODE_SAFETY_POLICY_VIOLATION
  case "max_scheduling_retries":
    return pb.ErrorCode_ERROR_CODE_JOB_RESOURCE_EXHAUSTED
  case "timeout":
    return pb.ErrorCode_ERROR_CODE_JOB_TIMEOUT
  case "permission_denied":
    return pb.ErrorCode_ERROR_CODE_JOB_PERMISSION_DENIED
  default:
    return pb.ErrorCode_ERROR_CODE_UNSPECIFIED
  }
}

Regression test coverage

core/controlplane/scheduler/engine_test.go

// core/controlplane/scheduler/engine_test.go (excerpt)
func TestMapStringToErrorCode(t *testing.T) {
  tests := []struct {
    code string
    want pb.ErrorCode
  }{
    {"timeout", pb.ErrorCode_ERROR_CODE_JOB_TIMEOUT},
    {"max_scheduling_retries", pb.ErrorCode_ERROR_CODE_JOB_RESOURCE_EXHAUSTED},
    {"unknown_code", pb.ErrorCode_ERROR_CODE_UNSPECIFIED},
  }
}

Validation runbook

Track taxonomy drift as an explicit reliability signal.

error-code-enum-migration-runbook.sh

bash

# 1) Collect top legacy error_code strings in last 24h
# - group by value and count

# 2) Compute UNSPECIFIED ratio
# - error_code_enum == ERROR_CODE_UNSPECIFIED / total failed results

# 3) Diff with mapping table
# - identify high-volume strings missing in mapStringToErrorCode

# 4) Add mapping + test in same PR
# - update switch table
# - update TestMapStringToErrorCode

# 5) Add alert threshold
# - alert if UNSPECIFIED ratio exceeds agreed baseline

Limitations and tradeoffs

Approach	Upside	Downside
Fallback to UNSPECIFIED (current)	Safe backward compatibility with no hard failures.	Can hide taxonomy drift and weaken root-cause analytics.
Strict reject unknown strings	Forces taxonomy discipline and immediate fixes.	Can break producers during phased migrations.
Dual-field period with miss metrics	Smooth migration and measurable drift control.	Requires ongoing governance to keep mapping table current.

- Compatibility-first fallback is practical, but it should never be telemetry-silent.
- Mapping tables are operational assets, not one-time code migrations.
- Current tests cover known values, but they do not alert operators when production introduces high-volume unmapped codes.

Next step

Implement this next:

1. Add `error_code_mapping_miss_total` metric for unknown string codes.
2. Add CI guard that fails when new reason codes are introduced without enum mapping tests.
3. Define an SLO for `ERROR_CODE_UNSPECIFIED` ratio by topic.
4. Publish a versioned error taxonomy changelog for SDK and operator teams.

Continue with AI Agent DLQ Emission Reliability and AI Agent Workflow Admission 429 vs 503.

AI Agent Error Code Enum Migration