Mastering Llama Safety and Guardrails: Llama Guard, Prompt Guard, and Code Shield (2026)
Last verified: June 2026 · Format: Guide · Est. time: 18 to 22 min
Meta ships its Llama models as open weights, which means the safety of any deployment is largely your responsibility. To help, Meta publishes a set of system-level safety tools called Llama Protections (formerly branded Purple Llama). These are reference implementations you wrap around the model: a content classifier, a prompt-attack detector, and an insecure-code scanner. None of them are part of the base model. You add them, and you decide how they are layered.
This guide explains what each tool does, where it sits in the request flow, and how to assemble them into a defense-in-depth stack. It covers Llama Guard for input and output moderation, Prompt Guard for catching prompt injection and jailbreak attempts before they reach the model, and Code Shield for filtering insecure code at the output stage. It also covers the official limits Meta documents, and what independent researchers found when they tried to evade these guardrails in early 2026. The honest takeaway up front: these are layers that reduce risk, not switches that guarantee safety.
The Llama Protections Stack at a Glance
Llama Protections is Meta's umbrella name for a group of system-level safety reference implementations. The brand was previously known as Purple Llama. The important mental model is that these tools sit around the model rather than inside it. A user request flows through them on the way in, the model generates a response, and the response flows through them on the way out.
Three components do most of the work, and each guards a different point in that flow:
- Prompt Guard inspects the incoming prompt for injection and jailbreak attempts before the main model ever sees it.
- Llama Guard classifies both the incoming prompt and the model's response against a hazard taxonomy, labeling each as safe or unsafe.
- Code Shield inspects generated output for insecure code patterns, catching vulnerable or malicious snippets before they are returned.
Meta also recommends pairing these classifiers with a strong system prompt, and notes that its own safety training uses a technique called safety context distillation: adversarial prompts are prefixed with a safe preamble during training, and the model is then fine-tuned on the safer outputs that result. With Llama 4, Meta used system prompts in part to reduce overly cautious or preachy refusals while keeping genuinely harmful requests blocked.
Guardrails are layers, not guarantees. Meta itself recommends combining classifiers, system prompts, and monitoring, and documents residual risk for every component.
Llama Guard: The Content Classifier
Llama Guard is a safety classifier. Given a piece of text, it outputs whether the content is safe or unsafe, and if unsafe, which hazard categories were violated. Unlike a simple keyword filter, it is itself a fine-tuned language model, so it reasons about context rather than matching strings.
What It Classifies Against
Llama Guard scores content against the MLCommons hazard taxonomy, a standardized set of 14 categories labeled S1-S14. These run from Violent Crimes (S1) through to categories such as Code Interpreter Abuse, giving deployers a common vocabulary for what counts as a violation. When Llama Guard flags content as unsafe, it names the specific categories that were triggered, so you can decide how to respond per category rather than applying a single blanket block.
Input and Output Moderation
A key strength of Llama Guard is that it works on both sides of a conversation. You can run it on the user's prompt (input moderation) to catch a harmful request before generation, and you can run it again on the model's response (output moderation) to catch a harmful completion before it reaches the user. Meta notes a tradeoff here: input filtering tends to raise the false-refusal rate more than output filtering does, because the classifier sometimes flags borderline prompts that would have produced a perfectly safe answer.
Versions and Sizes
Llama Guard 3 shipped in two forms: an 8B text-only classifier and an 11B vision-capable variant for moderating image plus text content. Llama Guard 4, released in April 2025, consolidates these into a single 12B natively multimodal model. Rather than training from scratch, Meta pruned Llama Guard 4 from the Llama 4 Scout Mixture-of-Experts model down to a dense, shared-expert network, so one model now handles both text and image moderation. As of mid-2026, Llama Guard 4 is the current release.
How to Deploy It
Llama Guard 4 is distributed on Hugging Face as Llama-Guard-4-12B under the meta-llama organization. You can run it locally through standard inference stacks such as the Transformers library or vLLM, call it through Meta's hosted Moderations API, or use it from the Azure model catalog. The deployment choice depends on whether you want full data isolation (run it yourself) or lower operational overhead (use a hosted endpoint).
Practical pattern: Run Llama Guard on input to block clearly harmful requests, and run it again on output to catch anything the model generated that slipped past. Use the per-category labels to tune responses, since not every category warrants the same action.
Prompt Guard: The Injection Detector
Prompt Guard solves a different problem from Llama Guard. Where Llama Guard asks "is this content harmful?", Prompt Guard asks "is this an attack on the system itself?" It classifies an incoming prompt into one of three buckets: benign, prompt injection, or jailbreak. It is built to catch attempts to override your instructions, smuggle in adversarial directives, or trick the model into ignoring its guardrails.
Where It Sits
Prompt Guard is input-only. It runs before the main model, screening the prompt as it arrives. Because it only inspects input, it is fast and cheap to run on every request, making it a natural first line of defense in front of a larger model.
Versions and Sizes
The first release of Prompt Guard was built on mDeBERTa, specifically the mDeBERTa-v3-base architecture, with an 86M-parameter backbone. Prompt Guard 2, released in April 2025 alongside Llama 4, comes in two sizes: an 86M version and a smaller 22M version. The 22M model is designed for latency-sensitive deployments where you want injection screening on every request without adding meaningful overhead. As of mid-2026, Prompt Guard 2 is the current release.
Why pair it with Llama Guard: Meta notes that Llama Guard, being a language model itself, is susceptible to prompt injection. Putting Prompt Guard 2 in front of the stack helps catch injection attempts before they can reach and potentially manipulate the content classifier.
Code Shield: The Insecure-Code Filter
Code Shield targets a risk specific to code-generating assistants. When a model writes code, it can produce snippets that compile and run but contain security vulnerabilities, or in the worst case, malicious patterns. Code Shield is an inference-time filter that inspects generated code and flags insecure output before it is returned to the user or written to a file.
Where It Sits
Code Shield operates on the output side, after the model has generated a response. It is most relevant for coding copilots, agent workflows that execute generated code, and any pipeline where model output flows into a build or runtime. Meta distributes it as part of its open trust-and-safety tooling on GitHub, so you integrate it into your serving pipeline rather than calling a hosted endpoint.
Scope note: Code Shield reduces the chance of shipping insecure generated code, but it is not a replacement for human code review, static analysis, or your existing application security pipeline. Treat it as one more gate, not the only one.
How the Three Tools Compare
The three components are easiest to reason about side by side. Each guards a different stage of the request, and together they cover input attacks, harmful content on both sides, and insecure code on the way out.
| Tool | Role | Size | Input / Output | How to Deploy |
|---|---|---|---|---|
| Llama Guard | Safety classifier: labels content safe or unsafe plus violated categories | v3: 8B text, 11B vision. v4: 12B multimodal | Both input and output | Hugging Face (Llama-Guard-4-12B), Moderations API, Azure |
| Prompt Guard | Classifies prompts as benign, prompt injection, or jailbreak | v1: 86M (mDeBERTa). v2: 86M and 22M | Input only | Run before the main model in your serving stack |
| Code Shield | Detects insecure or malicious generated code at inference | Tooling, not a single model size | Output only | Meta open trust-and-safety tools on GitHub |
Tool specs: Meta Llama documentation and model cards, as of mid-2026.
Plan Your Layers Before You Build
Before wiring anything together, make a few deliberate decisions. A guardrail stack works best when each layer is chosen on purpose rather than bolted on after an incident. Work through the checklist below to scope your deployment.
Build the Stack: A Layered Approach
Assemble the guardrails in the order a request actually travels. Each step adds a layer of defense at a different point in the flow. Use the tracker below to mark your progress.
- ✓Step 1: Add Prompt Guard 2 at the front
- ✓Step 2: Add Llama Guard on input and output
- ✓Step 3: Add Code Shield for code output
- ✓Step 4: Set a strong system prompt
- ✓Step 5: Monitor and red-team
Step 1: Add Prompt Guard 2 at the Front
Place Prompt Guard 2 first, before the request reaches your main Llama model. Because it screens input only and the 22M variant is small, you can run it on every request with minimal latency. Catching an injection or jailbreak here means it never reaches the model or, importantly, your content classifier.
Step 2: Add Llama Guard on Input and Output
Wrap the model with Llama Guard on both sides. On input, it blocks clearly harmful requests that are not attacks per se but still violate your policy. On output, it catches harmful completions before they return to the user. Remember that input filtering raises false refusals more, so tune your thresholds to your tolerance for over-blocking.
Step 3: Add Code Shield for Code Output
If your application generates code, add Code Shield to the output path. It inspects generated snippets for insecure patterns at inference time. This is most valuable for coding assistants and agent workflows that may execute what the model writes.
Step 4: Set a Strong System Prompt
Pair the classifiers with a clear system prompt that defines allowed and disallowed behavior for your specific use case. Meta found that strong system prompts can reduce unnecessary refusals while keeping genuinely harmful requests blocked, so this is both a safety and a usability lever.
Step 5: Monitor and Red-Team
No stack is finished at deployment. Log flagged requests, watch for drift in false-refusal and miss rates, and run an ongoing red-team process to probe for bypasses. The independent findings later in this guide show why continuous testing matters: attackers adapt, and a fixed set of guardrails will eventually be probed.
Known Limitations and Independent Findings
Meta is unusually direct about what these tools do not do, and independent researchers have published evasion results. Treat the cards below as the honest counterweight to the capability story above.
Separately, in February 2026, security firm Trail of Bits demonstrated prompt-injection techniques that bypassed AI browser assistants by disguising attacker instructions as fake security mechanisms, system instructions, or ordinary user requests. The finding was about the broader pattern of injecting AI assistants, and it reinforces the same lesson: input-screening guardrails reduce risk but determined attackers keep finding new framings.
Troubleshooting and Common Questions
Securing an open-weight Llama deployment is an exercise in layering. Prompt Guard 2 screens incoming requests for attacks, Llama Guard classifies content on both sides against the MLCommons hazard taxonomy, Code Shield catches insecure generated code, and a strong system prompt ties the policy together. Each tool is a reference implementation you control, which is the point of open weights: the safety posture is yours to define.
What the documentation and the independent findings make clear is that none of these layers is a guarantee. Meta flags weak spots on Defamation, Intellectual Property, and Elections, and notes that the classifiers can themselves be targeted. Independent tests in early 2026 showed real attacks getting through. The mature approach is to combine the layers, tune for your false-refusal tolerance, and treat monitoring and red-teaming as permanent parts of the system rather than a launch-day checkbox.
Llama, Meta Llama, Llama Guard, Prompt Guard, and Code Shield are trademarks of Meta Platforms, Inc. MLCommons is a trademark of its respective owner. Tech Jacks Solutions is an independent publisher and is not affiliated with, endorsed by, or sponsored by Meta. All product names, logos, and brands are property of their respective owners and are used for identification purposes only.