The production problem
Error code migrations rarely fail loudly.
They fail by collapsing distinct failure classes into one fallback bucket.
In Cordum terms, that bucket is `ERROR_CODE_UNSPECIFIED`. The system still runs. Your charts still move. Your postmortems get worse.
If your retry and alert policy depends on error class, this is not cosmetic drift. It is control-plane signal loss.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC Status Codes | A stable status taxonomy with clear client and server semantics. | No migration pattern for mixed payloads that carry both legacy strings and new enums. |
| Google AIP-193 (Errors) | Canonical API error model and machine-readable error detail guidance. | No concrete runtime strategy for string-to-enum backfill in an existing event bus. |
| RFC 9457 Problem Details | Interoperable HTTP problem payload structure and extensibility. | Does not address protobuf enum migration and DLQ telemetry impacts inside a distributed control plane. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Result normalization | `handleJobResult` maps legacy `ErrorCode` to `ErrorCodeEnum` when enum is unset. | Old producers remain compatible while newer consumers can read enum values. |
| Mapping scope | Known strings include `approval_rejected`, `policy_denied`, `policy_violation`, `max_scheduling_retries`, `timeout`, and `permission_denied`. | Coverage is explicit and testable, but incomplete mappings degrade to UNSPECIFIED. |
| Fallback behavior | Unknown strings default to `ERROR_CODE_UNSPECIFIED`. | No runtime breakage, but classification precision drops. |
| DLQ path | DLQ emissions also populate `ErrorCodeEnum` from reason code mapping. | Terminal failures can keep structured classification even in dead-letter workflows. |
| Test enforcement | `TestMapStringToErrorCode` verifies expected mappings plus unknown fallback behavior. | Mapping regressions are catchable before release. |
Mapping code paths
Normalization in result handler
// core/controlplane/scheduler/engine.go (excerpt)
if res.ErrorCodeEnum == pb.ErrorCode_ERROR_CODE_UNSPECIFIED && res.ErrorCode != "" {
res.ErrorCodeEnum = mapStringToErrorCode(res.ErrorCode)
}Legacy string mapping table
// core/controlplane/scheduler/engine.go (excerpt)
func mapStringToErrorCode(code string) pb.ErrorCode {
switch code {
case "approval_rejected", "policy_denied":
return pb.ErrorCode_ERROR_CODE_SAFETY_DENIED
case "policy_violation":
return pb.ErrorCode_ERROR_CODE_SAFETY_POLICY_VIOLATION
case "max_scheduling_retries":
return pb.ErrorCode_ERROR_CODE_JOB_RESOURCE_EXHAUSTED
case "timeout":
return pb.ErrorCode_ERROR_CODE_JOB_TIMEOUT
case "permission_denied":
return pb.ErrorCode_ERROR_CODE_JOB_PERMISSION_DENIED
default:
return pb.ErrorCode_ERROR_CODE_UNSPECIFIED
}
}Regression test coverage
// core/controlplane/scheduler/engine_test.go (excerpt)
func TestMapStringToErrorCode(t *testing.T) {
tests := []struct {
code string
want pb.ErrorCode
}{
{"timeout", pb.ErrorCode_ERROR_CODE_JOB_TIMEOUT},
{"max_scheduling_retries", pb.ErrorCode_ERROR_CODE_JOB_RESOURCE_EXHAUSTED},
{"unknown_code", pb.ErrorCode_ERROR_CODE_UNSPECIFIED},
}
}Validation runbook
Track taxonomy drift as an explicit reliability signal.
# 1) Collect top legacy error_code strings in last 24h # - group by value and count # 2) Compute UNSPECIFIED ratio # - error_code_enum == ERROR_CODE_UNSPECIFIED / total failed results # 3) Diff with mapping table # - identify high-volume strings missing in mapStringToErrorCode # 4) Add mapping + test in same PR # - update switch table # - update TestMapStringToErrorCode # 5) Add alert threshold # - alert if UNSPECIFIED ratio exceeds agreed baseline
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Fallback to UNSPECIFIED (current) | Safe backward compatibility with no hard failures. | Can hide taxonomy drift and weaken root-cause analytics. |
| Strict reject unknown strings | Forces taxonomy discipline and immediate fixes. | Can break producers during phased migrations. |
| Dual-field period with miss metrics | Smooth migration and measurable drift control. | Requires ongoing governance to keep mapping table current. |
- - Compatibility-first fallback is practical, but it should never be telemetry-silent.
- - Mapping tables are operational assets, not one-time code migrations.
- - Current tests cover known values, but they do not alert operators when production introduces high-volume unmapped codes.
Next step
Implement this next:
- 1. Add `error_code_mapping_miss_total` metric for unknown string codes.
- 2. Add CI guard that fails when new reason codes are introduced without enum mapping tests.
- 3. Define an SLO for `ERROR_CODE_UNSPECIFIED` ratio by topic.
- 4. Publish a versioned error taxonomy changelog for SDK and operator teams.
Continue with AI Agent DLQ Emission Reliability and AI Agent Workflow Admission 429 vs 503.