Skip to content
Deep Dive

AI Agent Error Code Enum Migration

Error taxonomy migrations fail quietly. They do not crash systems. They corrupt observability first.

Deep Dive10 min readApr 2026
TL;DR
  • -Cordum auto-populates `error_code_enum` when legacy `error_code` exists and enum is still `UNSPECIFIED`.
  • -Unknown legacy codes map to `ERROR_CODE_UNSPECIFIED`, which keeps compatibility but weakens analytics and routing precision.
  • -Scheduler tests lock current mappings for known codes like `timeout`, `policy_denied`, and `max_scheduling_retries`.
  • -You need explicit telemetry for mapping misses or `UNSPECIFIED` will hide taxonomy drift for months.
Failure mode

A new string error ships in one component, but enum mapping is not updated. Dashboards collapse into UNSPECIFIED.

Current behavior

Scheduler maps selected strings to enum values and defaults unknown values to `ERROR_CODE_UNSPECIFIED`.

Operational payoff

Backward compatibility is preserved while clients move to enum-first error handling.

Scope

This guide covers scheduler-level error code normalization and downstream observability effects during taxonomy migration.

The production problem

Error code migrations rarely fail loudly.

They fail by collapsing distinct failure classes into one fallback bucket.

In Cordum terms, that bucket is `ERROR_CODE_UNSPECIFIED`. The system still runs. Your charts still move. Your postmortems get worse.

If your retry and alert policy depends on error class, this is not cosmetic drift. It is control-plane signal loss.

What top results cover and miss

SourceStrong coverageMissing piece
gRPC Status CodesA stable status taxonomy with clear client and server semantics.No migration pattern for mixed payloads that carry both legacy strings and new enums.
Google AIP-193 (Errors)Canonical API error model and machine-readable error detail guidance.No concrete runtime strategy for string-to-enum backfill in an existing event bus.
RFC 9457 Problem DetailsInteroperable HTTP problem payload structure and extensibility.Does not address protobuf enum migration and DLQ telemetry impacts inside a distributed control plane.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Result normalization`handleJobResult` maps legacy `ErrorCode` to `ErrorCodeEnum` when enum is unset.Old producers remain compatible while newer consumers can read enum values.
Mapping scopeKnown strings include `approval_rejected`, `policy_denied`, `policy_violation`, `max_scheduling_retries`, `timeout`, and `permission_denied`.Coverage is explicit and testable, but incomplete mappings degrade to UNSPECIFIED.
Fallback behaviorUnknown strings default to `ERROR_CODE_UNSPECIFIED`.No runtime breakage, but classification precision drops.
DLQ pathDLQ emissions also populate `ErrorCodeEnum` from reason code mapping.Terminal failures can keep structured classification even in dead-letter workflows.
Test enforcement`TestMapStringToErrorCode` verifies expected mappings plus unknown fallback behavior.Mapping regressions are catchable before release.

Mapping code paths

Normalization in result handler

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
if res.ErrorCodeEnum == pb.ErrorCode_ERROR_CODE_UNSPECIFIED && res.ErrorCode != "" {
  res.ErrorCodeEnum = mapStringToErrorCode(res.ErrorCode)
}

Legacy string mapping table

core/controlplane/scheduler/engine.go
go
// core/controlplane/scheduler/engine.go (excerpt)
func mapStringToErrorCode(code string) pb.ErrorCode {
  switch code {
  case "approval_rejected", "policy_denied":
    return pb.ErrorCode_ERROR_CODE_SAFETY_DENIED
  case "policy_violation":
    return pb.ErrorCode_ERROR_CODE_SAFETY_POLICY_VIOLATION
  case "max_scheduling_retries":
    return pb.ErrorCode_ERROR_CODE_JOB_RESOURCE_EXHAUSTED
  case "timeout":
    return pb.ErrorCode_ERROR_CODE_JOB_TIMEOUT
  case "permission_denied":
    return pb.ErrorCode_ERROR_CODE_JOB_PERMISSION_DENIED
  default:
    return pb.ErrorCode_ERROR_CODE_UNSPECIFIED
  }
}

Regression test coverage

core/controlplane/scheduler/engine_test.go
go
// core/controlplane/scheduler/engine_test.go (excerpt)
func TestMapStringToErrorCode(t *testing.T) {
  tests := []struct {
    code string
    want pb.ErrorCode
  }{
    {"timeout", pb.ErrorCode_ERROR_CODE_JOB_TIMEOUT},
    {"max_scheduling_retries", pb.ErrorCode_ERROR_CODE_JOB_RESOURCE_EXHAUSTED},
    {"unknown_code", pb.ErrorCode_ERROR_CODE_UNSPECIFIED},
  }
}

Validation runbook

Track taxonomy drift as an explicit reliability signal.

error-code-enum-migration-runbook.sh
bash
# 1) Collect top legacy error_code strings in last 24h
# - group by value and count

# 2) Compute UNSPECIFIED ratio
# - error_code_enum == ERROR_CODE_UNSPECIFIED / total failed results

# 3) Diff with mapping table
# - identify high-volume strings missing in mapStringToErrorCode

# 4) Add mapping + test in same PR
# - update switch table
# - update TestMapStringToErrorCode

# 5) Add alert threshold
# - alert if UNSPECIFIED ratio exceeds agreed baseline

Limitations and tradeoffs

ApproachUpsideDownside
Fallback to UNSPECIFIED (current)Safe backward compatibility with no hard failures.Can hide taxonomy drift and weaken root-cause analytics.
Strict reject unknown stringsForces taxonomy discipline and immediate fixes.Can break producers during phased migrations.
Dual-field period with miss metricsSmooth migration and measurable drift control.Requires ongoing governance to keep mapping table current.
  • - Compatibility-first fallback is practical, but it should never be telemetry-silent.
  • - Mapping tables are operational assets, not one-time code migrations.
  • - Current tests cover known values, but they do not alert operators when production introduces high-volume unmapped codes.

Next step

Implement this next:

  1. 1. Add `error_code_mapping_miss_total` metric for unknown string codes.
  2. 2. Add CI guard that fails when new reason codes are introduced without enum mapping tests.
  3. 3. Define an SLO for `ERROR_CODE_UNSPECIFIED` ratio by topic.
  4. 4. Publish a versioned error taxonomy changelog for SDK and operator teams.

Continue with AI Agent DLQ Emission Reliability and AI Agent Workflow Admission 429 vs 503.

Taxonomy drift is an incident precursor

If UNKNOWN keeps rising, your control plane is telling you the contract changed before the docs did.