Prompt injection & jailbreaks
An AI assistant can't reliably tell the difference between the instructions you trust and the text it just read. That single gap is the root of the most common attack on language-model apps. Learn what prompt injection is, how the indirect kind hides in content the model ingests, why it works, and the layered defenses that reduce — but never fully eliminate — the risk.
01What prompt injection is — and how it differs from a jailbreak
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
Prompt injection is an attack where adversarial text is slipped into a language model's input so the model follows the attacker's instructions instead of, or on top of, the ones the developer intended. The OWASP Top 10 for LLM Applications ranks it as LLM01 — the number-one risk for LLM apps, and the NIST adversarial-machine-learning taxonomy treats it as a core attack class for generative AI.
It's easy to confuse with a jailbreak, but the two target different things. A jailbreak goes after the model's safety training — trying to coax it into producing content it was aligned to refuse. Prompt injection goes after the application's instruction boundary — overriding what the developer told the system to do (for example, getting it to leak data or misuse a connected tool). They overlap and a single attack can do both, but keeping them straight matters: a jailbreak defeats guardrails on output; an injection defeats the line between trusted instructions and untrusted data.
- Prompt injection = attacker text overrides the developer's intended instructions (OWASP LLM01; NIST AI 100-2e2025).
- Jailbreak = defeating the model's safety/alignment training to elicit prohibited output.
- They overlap, but injection attacks the instruction boundary; jailbreaks attack the safety layer.
02Direct vs. indirect: where the malicious text comes from
Prompt injection splits into two families, and the indirect one is what makes this a serious systems problem rather than a chat-window curiosity. MITRE ATLAS catalogs both under a single technique for LLM prompt injection, with direct and indirect sub-techniques.
The user types the attack
Malicious instructions are entered straight into the prompt by someone interacting with the app. The attacker and the user are the same person — the classic "ignore your instructions and do X instead" typed into a chat box.
The attack hides in content the model reads
Malicious instructions are planted in external data the model later ingests — a web page, document, email, support ticket, or retrieved knowledge-base snippet. The attacker never touches the app directly; the model encounters the payload while doing its job.
The indirect class was demonstrated in research by Greshake and colleagues, who showed that retrieved or external content can act as injected instructions in real LLM-integrated applications — enabling remote influence over the assistant and even payloads that propagate from one document to another. As soon as a model reads anything an attacker can influence, that content becomes a potential instruction channel.
- Direct injection — the person using the app supplies the malicious text themselves.
- Indirect injection — the payload is embedded in content the model fetches or is fed (web, docs, email, RAG results).
- Indirect injection means an attacker can reach the model without ever using the app (Greshake et al.; MITRE ATLAS).
03Why it works: instructions and data look the same
The root cause is structural, not a bug a patch can close. A language model receives trusted instructions and untrusted data as the same kind of thing — natural-language text in one shared context window — and there is no reliable internal boundary marking which is which. The UK's National Cyber Security Centre describes the model as a "confusable deputy": it acts on whatever instruction-shaped text it sees, with no robust way to know the difference between its developer's orders and a sentence buried in a document. OWASP, Google's Secure AI Framework, and NIST all attribute the vulnerability to this same instruction/data conflation.
That's also why jailbreaks persist even after extensive safety training. The "Jailbroken" research identifies two underlying failure modes: competing objectives (the model's drive to be helpful and capable pulls against its safety goals) and mismatched generalization (safety training doesn't cover every domain where the model still has capability). Separately, optimization-based methods such as the GCG (Greedy Coordinate Gradient) attack have shown that an appended adversarial suffix can be tuned to make an aligned model start with an affirmative reply — and that suffixes found on open models can transfer to commercial ones. Treat those reported success rates as historical, model- and benchmark-specific evidence of the attack class, not a current per-model score; defenses have since adapted.
- Models read trusted instructions and untrusted data in one undifferentiated text stream — no built-in security boundary (NCSC; OWASP; SAIF; NIST).
- Jailbreaks endure because of competing objectives and mismatched generalization in safety training (Wei et al.).
- Adversarial-suffix research (GCG) showed jailbreak suffixes can be universal and transferable — cite as historical evidence, not a fixed rate (Zou et al.).
04See it: guardrails off vs. on
Here a simulated assistant is given a trusted system instruction ("summarize the document, never reveal secrets") and then handed an untrusted document that contains a hidden line trying to hijack it. This is an illustration only — there is no real model and no working exploit. Flip the guardrail switch to watch the difference: with guardrails off, the simulated assistant treats the buried line as a real instruction; with guardrails on, that text is isolated as data and ignored.
Trusted system instruction
Untrusted document the model reads
Simulated model output
Illustrative simulation. Real systems use layered controls (below); no single switch makes a model immune, and authoritative sources agree there is currently no complete defense.
05Defenses: layers, not a silver bullet
Because the weakness is structural, the accepted answer is defense in depth — several independent controls so that when one is bypassed, others still contain the damage. The recurring recommendations across OWASP, NCSC, Google's SAIF, NIST, and vendor guidance fall into a few groups:
- Privilege separation & least privilege — give the model and any agent the narrowest possible access to tools, data, and actions, so a successful injection has a small blast radius (OWASP LLM01; NCSC).
- Input & output filtering — sanitize and screen untrusted content going in, and check responses going out; classifier-based detection can flag likely injected instructions in retrieved context (SAIF; Anthropic prompt-injection defenses).
- Human-in-the-loop approval — require explicit human confirmation before the system takes high-risk or irreversible actions (sending data, executing code, making purchases) (OWASP; OpenAI agent safety).
- Instruction hierarchy — train and prompt the model to rank instructions by trust (system > developer > user > tool output) so lower-trust text can't override higher-trust orders (Wallace et al.; OpenAI).
- Deterministic safeguards & adversarial testing — wrap the model in conventional, non-AI controls that constrain what it can do, and red-team continuously with deliberately injected documents and tool outputs (NCSC; Anthropic; NIST AI 600-1).
One caveat runs through every authoritative source: these controls reduce risk, they do not eliminate it. NCSC states mitigations cannot remove the likelihood of attack; OpenAI frames prompt injection as an open "frontier security challenge"; and as the term's originator notes, even a defense that works ~95% of the time still fails against an adaptive attacker. The agentic shift makes this sharper: once a model can browse, call tools, run code, or query a knowledge base, a single successful injection can turn into data exfiltration or unauthorized actions. Treat any vendor or tool claim of "fully prevents prompt injection" as marketing, not fact.
06Check your understanding
07Take it with you & go deeper
AI security — the essentials
The broader picture: the main risks to AI systems and the controls that defend them.
Read →AI agents, explained
Why tools, browsing, and RAG widen the injection attack surface — and how agents amplify the risk.
Read →Securing AI agents & tool use
Least privilege, human-in-the-loop, and deterministic safeguards for tool-using assistants.
Coming soonRed-teaming language models
How teams stress-test models with injected documents and adversarial prompts — defensively.
Coming soon⊕Concept map
The whole lesson at a glance — expand each branch to see the key ideas it covers.
Prompt injection vs. jailbreak
- Prompt injection = adversarial text makes the model follow the attacker's instructions instead of, or on top of, the developer's (OWASP LLM01).
- It is OWASP's #1 LLM application risk (LLM01:2025) and a core attack class in the NIST AI 100-2e2025 adversarial-ML taxonomy.
- A jailbreak defeats the model's safety/alignment training; an injection defeats the application's instruction boundary.
Direct vs. indirect injection
- Direct (MITRE AML.T0051.000): the person using the app types the malicious text themselves.
- Indirect (AML.T0051.001; Greshake et al.): the payload hides in external content the model later ingests — web pages, docs, email, tickets, RAG results.
- Indirect injection lets an attacker reach the model without ever using the app directly.
Why it works
- Models read trusted instructions and untrusted data as one undifferentiated text stream — no built-in security boundary (NCSC's "confusable deputy"; OWASP; SAIF; NIST).
- Jailbreaks persist because of competing objectives and mismatched generalization in safety training (Wei et al.).
- Adversarial-suffix research (GCG) showed jailbreak suffixes can be universal and transferable — historical evidence of the attack class, not a fixed rate (Zou et al.).
Defenses: layers, not a silver bullet
- Privilege separation & least privilege shrink the blast radius of a successful injection (OWASP LLM01; NCSC).
- Input/output filtering, human-in-the-loop approval, and instruction hierarchy (system > developer > user > tool output) all reduce risk (SAIF; OpenAI; Anthropic).
- Deterministic safeguards plus continuous adversarial red-teaming wrap the model in non-AI controls (NCSC; NIST AI 600-1).
No complete fix & agentic amplification
- Authoritative sources agree mitigations reduce risk but do not eliminate it; treat any "fully prevents" claim as marketing (NCSC; OpenAI; Simon Willison).
- Once a model can browse, call tools, run code, or query a knowledge base, one successful injection can become data exfiltration or unauthorized actions (OWASP LLM01; NIST; MITRE ATLAS).
→Related lessons
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established security concepts for defensive, educational purposes and is grounded in the authoritative references below. The interactive demo is a simulation, labelled as such; it contains no working exploit. Prompt injection is an actively evolving threat with no known complete defense — treat any "solved" claim, vendor or otherwise, as risk reduction rather than elimination.
- LLM01:2025 Prompt Injection — OWASP Top 10 for LLM Applications — OWASP Gen AI Security Project
- MITRE ATLAS — Adversarial Threat Landscape for AI Systems — The MITRE Corporation
- NIST AI 100-2e2025 — Adversarial Machine Learning: A Taxonomy of Attacks and Mitigations — NIST
- NIST AI 600-1 — Generative AI Profile (AI RMF) — NIST
- Not what you've signed up for: Indirect Prompt Injection in real LLM apps — Greshake et al. (arXiv 2302.12173)
- Jailbroken: How Does LLM Safety Training Fail? — Wei, Haghtalab & Steinhardt (arXiv 2307.02483)
- Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG) — Zou et al. (arXiv 2307.15043)
- AI and cyber security: what you need to know — UK NCSC
- Secure AI Framework (SAIF) — Google
- Mitigating the risk of prompt injections in browser use — Anthropic
- Understanding prompt injections: a frontier security challenge — OpenAI
- Prompt injection writeups (tagged archive) — Simon Willison
Prompt injection & jailbreaks — the essentials
Tech Jacks Solutions · AI Knowledge Hub · educational summary (defensive)
What it is
Prompt injection slips adversarial text into a model's input so it follows the attacker's instructions instead of the developer's. It is OWASP's #1 LLM application risk (LLM01). A jailbreak defeats safety training; an injection defeats the trusted-instruction boundary.
Direct vs. indirect
Direct: the user types the malicious text. Indirect: the payload is hidden in content the model later reads — web pages, documents, email, support tickets, retrieved knowledge-base results — so the attacker never touches the app directly.
Why it works
The model reads trusted instructions and untrusted data as the same kind of text, with no reliable internal boundary (NCSC: a "confusable deputy"). Jailbreaks persist due to competing objectives and mismatched generalization in safety training.
Defenses (layers, not a silver bullet)
Least privilege; input/output filtering and classifier detection; human-in-the-loop approval for high-risk actions; instruction hierarchy (system > developer > user > tool output); deterministic safeguards and continuous adversarial testing. Mitigations reduce — not eliminate — the risk.