Prompt Injection Cannot Be Patched. Design Around It.

RRogue AI·2026-06-21·10 min read

An AI agent reading a malicious instruction hidden inside a web page on the same stream as its real task, unable to tell command from data

You cannot patch prompt injection, and the sooner that lands the sooner you start building agents that survive it. It is not a defect in one model that a point release will close. It is a property of how language models work: instructions and the data they read arrive as one stream of tokens, and the model has no reliable way to tell your command apart from a sentence it just lifted out of a web page, an email, or a PDF. A 2026 study that ran 3,168 attack simulations against browser agents on GPT-5 and Gemini found direct injection succeeded more than 79 percent of the time. The title of the paper says the quiet part out loud: agents may always fall for prompt injection.

So stop trying to win the argument with the attacker inside the prompt. The durable move is architectural: design the system so that a hijacked agent reaches nothing that matters. This is what that actually looks like, why filtering is a losing game, and the two or three structural patterns that hold up when the injection inevitably lands.

Why a smarter model will not save you

Every few weeks someone argues the next model will be robust enough to ignore injected instructions. It will not, and the reason is structural rather than a matter of scale. An agent has to read untrusted content to do its job, summarise this inbox, research that page, parse this invoice, and the moment it reads attacker-controlled text, that text is in the same context window as your instructions, wearing the same clothes. There is no field on a token that says “this part is trusted.” The model is doing exactly what it was built to do: predict a continuation from everything in front of it. The injected sentence is just more of the everything.

That is why the success rates stay stubborn across frontier models. We put a number on the agent version of this in why the prompt-injection number never reaches zero: safeguards push it down, they do not push it to zero, and “rarely” is not a security property when the agent can move money or delete files. A better model lowers the odds on any single attempt. It does not change the fact that the attacker gets unlimited attempts and only has to win once.

The lethal trifecta: when an agent is unconditionally exposed

Simon Willison gave the dangerous configuration a name that is worth memorising: the lethal trifecta. An agent is unconditionally exposed to data exfiltration the moment it has all three of these at once.

Access to private data. It can read your mailbox, your documents, your database, your secrets.
Exposure to untrusted content. It processes text from sources an attacker can influence, a web page, an email, a shared file, a tool result.
An exfiltration path. It can send data out, a request, a link it renders, a message it posts, a tool it calls.

Hold all three and you are not “mostly safe with good prompts.” You are one cleverly worded paragraph away from the agent reading a secret and shipping it somewhere on the attacker’s behalf, with no exploit, no malware, just text. The design lesson is blunt: never let one agent hold all three at full strength. Break the triangle and the unconditional exposure goes away.

Stop filtering. Separate control from data.

Input filters, jailbreak classifiers, and “ignore instructions in retrieved content” system prompts are all the same bet: that you can spot the bad text. It is whack-a-mole against an attacker with a thesaurus and infinite tries. The patterns that actually hold do something different, they stop the untrusted text from ever steering what the agent does.

The cleanest expression of this is Google DeepMind’s CaMeL design, which borrows two ideas straight out of classic systems security: control-flow integrity and capabilities. A privileged model sees only your trusted request and writes the plan. A separate quarantined model handles the untrusted content and is stripped of the ability to call tools. Because the plan is fixed before any untrusted data is touched, that data can change what the agent knows but never what it does. On the AgentDojo benchmark this blocked close to 100 percent of attacks. The injected text still arrives, it just has nowhere to go.

The Rule of Two, and controls that live outside the model

You do not always need a full dual-model split. Meta’s “Agents Rule of Two” is the cheap version of the same idea: an unsupervised agent should hold at most two of the trifecta’s three capabilities, and the moment it needs all three, a human approves the step. Reading untrusted web pages and summarising them for you is fine. Reading untrusted web pages, holding your credentials, and being able to post outbound is not, until a person signs off.

Notice where the control sits in every one of these patterns: outside the model, not inside the prompt. The agent is assumed to be fooled. The container it runs in, the credentials it is handed, the egress it is allowed, and the actions that require confirmation are decided by code the injected text can never reach. This is the same containment argument we make for coding agents in why your coding agent needs a leash, and the same access-control discipline behind securing RAG pipelines against prompt injection.

What stops a determined injection, and what does not

Defense	Stops a determined injection?	Why
System-prompt hardening	No	Untrusted text shares the context window; the model cannot reliably obey one part and ignore another
Input filters and jailbreak classifiers	No	Pattern matching against an attacker with unlimited rewordings and unlimited tries
Control and data separation (dual model)	Yes	The plan is fixed before untrusted data is read, so injection changes knowledge, not actions
Capability scoping and Rule of Two	Yes	A hijacked agent never holds private data, untrusted input, and egress at the same time
Egress control and human gates outside the model	Yes	The injected text cannot reach the code that decides what may leave or run

A build pattern that survives

Designing for the injection rather than against it comes down to a short, boring checklist that you decide once, in code, not per prompt.

Break the trifecta by default. Map every agent against private data, untrusted input, and egress. If one agent has all three, split it or insert a human gate before the third.
Keep untrusted content away from tool-calling. The component that reads the web, the inbox, or the document should not be the component that holds your credentials and calls tools.
Scope capabilities, not just permissions. Hand the agent the narrowest, most disposable credentials that still let it work, the same instinct behind securing self-hosted AI infrastructure.
Control egress, not just input. Most injection payoff is exfiltration. An allowlist on where the agent can send data defangs the attack even when the prompt wins.
Gate the irreversible. Money, deletes, deployments, and outbound messages get an explicit human confirmation per call, never an implicit yes.
Red-team the architecture, not the wording. Test whether an injected instruction can reach a tool, not whether one particular phrasing slips through, the way we treat evaluation in testing AI systems before production.

Design for the mistake

The teams that ship agents safely are not the ones with the cleverest system prompt. They are the ones who assumed the agent would be fooled and made sure it did not matter. Prompt injection stops being an emergency the day you stop treating it as a bug to be patched and start treating it as the weather, a condition you build for. The model will read the malicious sentence. Whether that sentence can do anything is your decision, and it is one you make in the architecture, not in the prompt. We make the same case for putting agents into production deliberately in what breaks in agent orchestration at production scale and building LLM features that survive production.