The Model Context Protocol is the connective tissue between LLMs and external systems — filesystems, databases, APIs, browsers, internal tooling. In a well-functioning agentic deployment, MCP lets a model read documents, query databases, and take actions in the world. In an adversarial deployment, MCP becomes the mechanism by which attacker-controlled content reaches a model's context window and convinces it to take actions the user never authorized.
We previously covered the RCE class of MCP vulnerability and AEGIS detection at the network layer (Anthropic MCP RCE: 7,000 Servers Exposed). This piece is about a different and subtler problem: the confused deputy — the provenance gap in the protocol design itself — and the specific attack paths that flow from it.
What Is the Confused Deputy Problem in MCP?
The confused deputy is a classic access-control concept: an entity with legitimate authority is manipulated by an untrusted party into exercising that authority on the untrusted party's behalf. In MCP, the confused deputy is the model itself.
Here is the structural problem: when an MCP tool call returns a result, the protocol carries the content of that result into the model's context window. It does not carry cryptographically attestable information about where that content came from, who produced it, or whether it has been tampered with in transit. The model receives a tool_result and must decide how to act on it based on its content — but the content could come from a trusted server, a compromised server, a DNS-rebinded endpoint, or a webpage that an attacker controls and has stuffed with instruction text designed to override the model's behavior.
The MCP specification includes a ToolAnnotations object — fields like readOnlyHint, destructiveHint, idempotentHint — intended to communicate properties of tools to clients. The specification is explicit that these annotations are advisory and not security boundaries. They can be set by the server and have no verifiable binding to actual tool behavior. A malicious or compromised server can annotate a destructive write operation as readOnly. An instruction-injected tool result can include fabricated annotations in its text. Neither the client nor the model can verify them cryptographically.
The result is a protocol where the model is asked to make trust decisions about content for which the protocol provides no trustworthy provenance signal.
The Official Fetch Server: Attacker-Controlled Markdown Straight to Context
The reference MCP fetch server — the official Anthropic-maintained server for web browsing tool calls — accepts a URL, fetches the content, converts the HTML to Markdown, and returns it as a tool_result. The model receives this Markdown and processes it as part of its context.
Consider what happens when the URL being fetched is attacker-controlled. The attacker can place arbitrary text in the page, including:
- Instructions formatted to look like system-prompt or tool-definition content
- Text that references other tool names in the model's context and instructs the model to call them with specific arguments
- Content that mimics the format of legitimate tool results to build false context
- Instructions to exfiltrate information from context by encoding it in a follow-up URL fetch
This is not a hypothetical. It is the standard indirect prompt injection pattern applied to the MCP fetch surface. The fetch server converts web content to Markdown without sanitizing or labeling it as untrusted external content — it lands in the model's context with the same structural weight as system instructions or legitimate tool results. The model has no reliable mechanism to distinguish "this is content I fetched from an external page" from "this is an instruction from my operator."
The attack does not require compromising an MCP server. It requires only the ability to serve content at a URL the model will be asked to fetch — which in many agentic workflows is directly controlled by user input or by attacker-controlled content in previously fetched documents.
The DNS Rebinding Footgun
DNS rebinding attacks against MCP create a distinct and particularly dangerous attack path. The mechanism:
- An attacker registers a domain and configures a DNS server that initially resolves it to a public IP.
- The MCP fetch server checks the URL — it resolves to a public IP, passes any allowlist or block check, and proceeds.
- The attacker's DNS server drops the TTL to zero and rebinds the domain to an internal IP address — 192.168.x.x, 10.x.x.x, or 172.16-31.x.x — typically the IP of an internal service that the MCP server can reach from its network position.
- When the MCP server makes the HTTP request, the DNS lookup now resolves to the internal IP. The fetch server connects to the internal service and returns its content to the model.
This bypasses most URL-based filtering because the filtering happens at allowlist check time (public IP), not at fetch time (internal IP). The MCP server becomes an involuntary proxy to internal services, and the attacker now has the model reading the output of those internal services.
The footgun is particularly sharp in self-hosted agentic deployments where the MCP server runs inside the network perimeter and has access to internal APIs, documentation systems, internal dashboards, or configuration management endpoints that are not exposed publicly.
Detection Rules
The following are concrete, loggable signals for MCP deployments. These should be implemented at the MCP proxy layer, in SIEM rules, or in the agent monitoring infrastructure.
Instruction injection detection:
- Rule: Tool result content contains system-prompt-like structure. Alert when a tool_result payload contains strings that match system-prompt formatting patterns: XML-like role tags (
<system>,<instructions>,[INST]), markdown headers framing instructions (## New Instructions,## Override), or explicit references to model roles. These patterns appearing in tool results, not in system prompts, indicate injection attempts. - Rule: Tool result references other tool names with arguments. Alert when the text content of a tool_result includes strings that match tool names registered in the current MCP session, especially when followed by argument-like structures. A webpage result that says "now call the write_file tool with the following content" is an injection attempt, not legitimate page content.
- Rule: Tool result size exceeds threshold relative to request. Large tool results — particularly from fetch operations on relatively simple URLs — may indicate content stuffed with injection text. Alert and human-review above a configurable size threshold for external-URL fetch results.
DNS rebinding detection:
- Rule: Destination IP of MCP fetch resolves to RFC1918 space. Log the resolved IP address at the time of the HTTP connection, not at URL parse time. Alert when a fetch to a non-private hostname (not .local, not explicit RFC1918) results in a TCP connection to 10.0.0.0/8, 172.16.0.0/12, or 192.168.0.0/16. This is the rebinding signal.
- Rule: DNS TTL below threshold on MCP-fetched domains. Log TTL values for all DNS resolutions made by the MCP fetch server. A TTL of zero or below 30 seconds on a non-CDN domain is a rebinding preparation signal. Alert for manual review before the fetch proceeds.
- Rule: HTTP redirect chain ending at internal IP. Detect redirect chains where an initial fetch to a public IP eventually redirects to an internal endpoint. Log and alert — legitimate content delivery does not redirect to RFC1918 addresses.
Behavioral anomaly detection:
- Rule: Tool call sequence diverges from established session pattern. If an agentic session typically calls a defined set of tools in a consistent sequence, alert when a new tool call type appears immediately after a fetch_url result. This pattern — fetch external content, then call a privileged tool — is consistent with successful injection.
- Rule: Tool call to privileged write/execute tools without prior user message. If a tool with destructive or write capability is called without a corresponding user instruction in the conversation turn that triggered it, flag for review. The model should not spontaneously decide to write files or execute code — that should trace back to explicit user intent.
- Rule: Data exfiltration via chained fetch calls. Alert when fetch_url tool calls include URL query parameters or fragments that contain strings also present in earlier tool results or context. This pattern — encoding context data into a URL and fetching it — is a common exfiltration channel through fetch-capable MCP servers.
Mitigations Beyond Detection
Detection is necessary but not sufficient. The structural mitigations:
- Label untrusted content explicitly in context. The model prompt architecture should clearly delineate tool results from external sources as untrusted. Wrapping external fetch results in a structured tag —
<external_content source="untrusted">— and training or prompting the model to treat such content as data, not instructions, reduces but does not eliminate injection risk. - Restrict fetch server access to a URL allowlist. For agentic deployments where the set of external resources is known, replace unrestricted fetch with a strict allowlist of domains. This eliminates the attacker-controlled fetch surface almost entirely.
- Run MCP servers in network-isolated environments. A fetch server that cannot reach internal RFC1918 addresses cannot be weaponized for DNS rebinding. Network segmentation is the most reliable DNS rebinding mitigation.
- Require per-call authorization for privileged tools. Tools with destructive or write capability should require an explicit human-in-the-loop confirmation rather than executing on model decision alone. This breaks the injection-to-execution chain even when injection succeeds.
The Provenance Problem Is Not Solved Yet
The root issue — that MCP tool results carry no cryptographically attestable provenance — is not patched by any of the above mitigations. They reduce the attack surface and increase detection probability, but they do not give the model a reliable way to distinguish trustworthy from untrusted content in its context window. That requires a protocol-level solution: a provenance field on tool results with a verifiable binding to the server's identity, analogous to TLS certificate verification at the transport layer.
Until such a mechanism exists and is widely deployed, the confused-deputy risk in MCP is structural. Defenders cannot patch it away — they can only reduce exposure through layered detection, strict network controls, and architectures that treat all external tool results as untrusted by default.
For the network-layer RCE class of MCP vulnerability and AEGIS fast-path detection, see Anthropic MCP RCE: 7,000 Servers Exposed and Why L1 Fast-Path Matters. For Sable's agentic security assessment offering, see our engagement options.