The production gap
Local MCP demos optimize for speed: one machine, one user, minimal controls, fast feedback. Production has different constraints: multi-tenant access, regulated actions, incident response, and audit obligations.
The protocol can be correct while the deployment is unsafe. The missing pieces are usually identity scope, action policy, and operational gates.
Security reality
A successful handshake tells you the server is reachable. It does not tell you the requested action should execute.
What top sources cover vs miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| MCP Security Best Practices (official) | Excellent attack-level detail (confused deputy, token passthrough, SSRF, session hijacking) with normative requirements. | No operational launch matrix for readiness gates, reviewer load, and production rollback criteria. |
| Understanding Authorization in MCP (official) | Clear OAuth-centered authorization flow, metadata discovery sequence, and token verification patterns. | Limited guidance on policy-driven action classes (read vs write vs high-risk) and approval queue operations. |
| Apollo MCP Dev-to-Prod Workflows | Practical environment progression and deployment workflow from local testing to monitored production. | No unified governance and output-safety model for cross-server action control and incident response. |
12 practices that matter
| Area | Practice | Why it matters | Validation check |
|---|---|---|---|
| Identity | OAuth 2.x auth for remote MCP servers | Static keys become long-lived breach tokens | 100% remote servers reject unauthenticated requests |
| Identity | Scope tokens by server and action class | Compromised read tokens should not mutate systems | Token scopes map to tool classes |
| Network | SSRF guardrails on metadata and redirects | OAuth discovery can hit internal targets if unchecked | Private IP ranges blocked in production |
| Policy | Pre-execution policy decision per tool call | Protocol validity is not risk validity | Every call logs ALLOW, DENY, APPROVAL, or CONSTRAINTS |
| Policy | Approval gate for write and high-impact actions | Human checkpoint for irreversible side effects | 0 unapproved high-risk writes |
| Policy | Policy snapshot + request hash binding | Auditors need proof of what was approved | Approval payload stores policy hash |
| Output | Output safety before model ingestion | Tool output can leak secrets or unsafe text | REDACT or QUARANTINE paths active |
| Output | Sensitive field classifier for common PII/secrets | Generic regex alone misses context | False negative review sample weekly |
| Observability | Tool-call metrics by server and action | Need trend visibility for abuse or drift | Dashboards include p95 latency and denial rates |
| Observability | Approval queue SLO monitoring | Governance bottlenecks silently kill operations | Queue p50 and timeout rate alerts configured |
| Operations | Environment-separated server catalogs | Cross-environment bleed causes blast radius | Prod agents cannot invoke dev tools |
| Operations | Incident runbook and revocation drill | Token theft response must be deterministic | Quarterly revoke-and-restore exercise passed |
Policy and auth implementation
Keep auth and policy independent. Auth verifies identity and token validity. Policy decides whether this identity can execute this action under current context.
version: v1
rules:
- id: allow-readonly-registered-servers
match:
labels:
mcp.server: ["github", "jira", "snowflake", "slack"]
mcp.action: read
decision: ALLOW
- id: approval-required-write-actions
match:
labels:
mcp.action: write
decision: REQUIRE_APPROVAL
constraints:
max_runtime_sec: 60
max_retries: 1
- id: deny-unregistered-server
match:
labels:
mcp.server: "*"
mcp:
deny_servers_not_in: ["github", "jira", "snowflake", "slack"]
decision: DENY# Example token verification guard (pseudo-code)
if request.transport == "remote_http":
assert bearer_token_present()
token = introspect_or_verify_jwt(bearer)
assert token.active
assert token.audience == "mcp-server"
assert token.scope in allowed_scopes_for_tool(tool_id)
else:
# local stdio mode can use env-based credentials for dev only
assert environment == "development"If you need deeper attack taxonomy coverage, pair this with MCP Security Risks.
Operational go/no-go gates
| Gate | Target | Block condition | Owner |
|---|---|---|---|
| Auth failure rate | < 0.5% | > 2% over 15m | Platform Security |
| Unapproved high-risk writes | 0 | > 0 immediate stop | Governance |
| Approval queue median wait | <= 10m | > 20m for 30m | Ops Lead |
| Tool call p95 latency | <= 2s | > 5s for 15m | MCP Platform |
| Output QUARANTINE ratio | < 1% | > 3% for 15m | Safety Team |
| Policy DENY anomaly | baseline ± 20% | > 2x baseline | Security Operations |
# mcp-go-no-go.sh set -euo pipefail UNAPPROVED_WRITES=$(curl -s "$API/metrics/unapproved-high-risk-writes?window=10m") APPROVAL_P50_MIN=$(curl -s "$API/metrics/approval-queue-p50-minutes?window=30m") TOOL_P95_MS=$(curl -s "$API/metrics/tool-call-p95-ms?window=15m") if [ "$UNAPPROVED_WRITES" -gt 0 ]; then echo "BLOCK: unapproved high-risk write detected" exit 1 fi if [ "$APPROVAL_P50_MIN" -gt 20 ]; then echo "BLOCK: approval queue latency exceeded" exit 1 fi if [ "$TOOL_P95_MS" -gt 5000 ]; then echo "BLOCK: tool latency SLO breach" exit 1 fi echo "PASS: production gate clear"
Limitations and tradeoffs
More control-plane work
Strong governance increases initial setup time, but reduces costly incident recovery later.
Approval friction risk
Over-classifying actions as high-risk creates queue buildup. Risk taxonomy tuning is mandatory.
Operational discipline required
Thresholds, owners, and drills must be maintained. Controls decay when not rehearsed.