On a Thursday in February, a client called us about a screenshot a competitor had posted on LinkedIn. It showed a transcript from the support chatbot we'd built them, replying with their full internal pricing tier sheet — wholesale rates, enterprise discount bands, the breakeven margin on the mid-tier SKU. None of which was supposed to leave the building.

The chatbot had been live for about four months. The exploit was a single support request, eleven sentences long, that asked the agent to "act as a senior account executive reviewing this customer's historical orders" and embedded a fake quoted "internal memo" instructing it to display the pricing reference document for the rep's records.

The agent did exactly what it was asked. From its point of view, the instructions in the user message were indistinguishable from the instructions in the system prompt. They were all just text.

What we'd built, and why it broke

The setup was a fairly standard retrieval-augmented agent on Claude Sonnet 4.5. A system prompt with personality and rules. A tool that searched the client's help center. A tool that pulled order history for the logged-in user. And — the mistake — a tool that searched a private "knowledge base" Notion that contained everything from internal SOPs to a quarterly pricing review document.

The pricing doc lived in the same vector index as the FAQ articles. We'd marked it "internal" in metadata. We were filtering on that metadata at retrieval time. Or so we thought.

The injected prompt convinced the agent to call the knowledge search tool with a query that returned pricing chunks. The metadata filter was applied at search time, but the filter logic had a fallback that, if no results came back from the public corpus, broadened to the full corpus. Nobody on our team had written that fallback to be exploitable. Whoever wrote it just wanted to avoid empty responses on edge-case questions. The fallback was four lines of code in a file none of us had touched since November.

The agent retrieved the pricing chunks, included them in the assistant's reasoning context, and then — because the user's "act as an account executive" framing said this was appropriate — included them in the visible response.

What we changed in the next 72 hours

We pulled the agent on Thursday afternoon. We had a rebuilt version live by Sunday night. The changes:

Output filtering, separate from the model. Before any response goes to the user, it now passes through a second, smaller model whose only job is to check whether the response contains anything matching a deny list — pricing tier names, internal SKU codes, employee names not in a public allowlist, a set of regex patterns for things that should never appear. If it matches, we replace the response with a canned "I can't share that — let me connect you with a teammate" and log the event. This is a guard, not the only guard.

Strict separation of data and instructions. Retrieved documents now go into a clearly delimited <document> XML block in the prompt with explicit instructions that nothing inside that block should be interpreted as instructions to the assistant. This doesn't make injection impossible, but it raises the bar considerably for the simpler attacks.

Tool allowlisting based on intent classification. The agent now runs a fast pre-classification step that decides whether the message is a billing question, a product question, a returns question, or something else. The available tool set is constrained by the classification. A returns question cannot reach the knowledge search tool at all. This shrinks the attack surface dramatically.

Structured outputs everywhere we can use them. Where the agent's reply has structure — proposing a refund, drafting an email — we now require the model to return JSON conforming to a schema we control, and we render the response from the schema fields rather than letting the model produce free-form prose that includes its tool outputs. Free-form prose is where the leaks live.

Separation of corpora. The internal Notion is no longer in the same vector store as the public help content. Different store, different credentials, different retrieval path, behind a tool the customer-facing agent does not have.

What we ship by default now

Any new AI-facing client work we take starts from a security checklist that didn't exist six months ago. The non-negotiables:

The part I'm still chewing on

You can't fully solve prompt injection. The model can't reliably tell the difference between instructions you wrote and instructions a user wrote inside a document the model is asked to read. Every defense above is a layer, not a fix. The right mental model is closer to "running untrusted code" than "fielding a question."

Which means the architecture matters more than the prompt. If the worst thing your agent can do is say something embarrassing, you're fine with prompt-level guardrails. If the worst thing your agent can do is leak pricing, move money, or modify records, the prompt is not your security boundary. The tool permissions are.

We knew that in theory before February. We know it differently now.