Why do my Llama guardrails block too many legitimate requests?

This is the false-refusal problem. Meta notes input filtering raises the false-refusal rate more than output filtering does. Shifting some checks to the output side, tuning thresholds, and refining the system prompt can reduce over-blocking.

Meta Llama

Mastering Llama Safety and Guardrails: Llama Guard, Prompt Guard, and Code Shield (2026)

Last verified: June 2026 · Format: Guide · Est. time: 18 to 22 min

Meta ships its Llama models as open weights, which means the safety of any deployment is largely your responsibility. To help, Meta publishes a set of system-level safety tools called Llama Protections (formerly branded Purple Llama). These are reference implementations you wrap around the model: a content classifier, a prompt-attack detector, and an insecure-code scanner. None of them are part of the base model. You add them, and you decide how they are layered.

This guide explains what each tool does, where it sits in the request flow, and how to assemble them into a defense-in-depth stack. It covers Llama Guard for input and output moderation, Prompt Guard for catching prompt injection and jailbreak attempts before they reach the model, and Code Shield for filtering insecure code at the output stage. It also covers the official limits Meta documents, and what independent researchers found when they tried to evade these guardrails in early 2026. The honest takeaway up front: these are layers that reduce risk, not switches that guarantee safety.

Core tools in the Llama Protections stack: Llama Guard, Prompt Guard, Code Shield

Source: Meta Llama documentation (2026)

Hazard categories (S1 to S14) in the MLCommons taxonomy Llama Guard classifies against

Source: Meta Llama Guard 4 model card (April 2025)

12B

Parameters in Llama Guard 4, a natively multimodal classifier pruned from Llama 4 Scout

Source: Meta Llama Guard 4 model card (April 2025)

66.2%

Of attack prompts blocked by Llama Guard in one independent test

Source: MindStudio (February 2026)

The Llama Protections Stack at a Glance

Llama Protections is Meta's umbrella name for a group of system-level safety reference implementations. The brand was previously known as Purple Llama. The important mental model is that these tools sit around the model rather than inside it. A user request flows through them on the way in, the model generates a response, and the response flows through them on the way out.

Three components do most of the work, and each guards a different point in that flow:

Prompt Guard inspects the incoming prompt for injection and jailbreak attempts before the main model ever sees it.
Llama Guard classifies both the incoming prompt and the model's response against a hazard taxonomy, labeling each as safe or unsafe.
Code Shield inspects generated output for insecure code patterns, catching vulnerable or malicious snippets before they are returned.

Meta also recommends pairing these classifiers with a strong system prompt, and notes that its own safety training uses a technique called safety context distillation: adversarial prompts are prefixed with a safe preamble during training, and the model is then fine-tuned on the safer outputs that result. With Llama 4, Meta used system prompts in part to reduce overly cautious or preachy refusals while keeping genuinely harmful requests blocked.

Guardrails are layers, not guarantees. Meta itself recommends combining classifiers, system prompts, and monitoring, and documents residual risk for every component.

Llama Guard: The Content Classifier

Llama Guard is a safety classifier. Given a piece of text, it outputs whether the content is safe or unsafe, and if unsafe, which hazard categories were violated. Unlike a simple keyword filter, it is itself a fine-tuned language model, so it reasons about context rather than matching strings.

What It Classifies Against

Llama Guard scores content against the MLCommons hazard taxonomy, a standardized set of 14 categories labeled S1-S14. These run from Violent Crimes (S1) through to categories such as Code Interpreter Abuse, giving deployers a common vocabulary for what counts as a violation. When Llama Guard flags content as unsafe, it names the specific categories that were triggered, so you can decide how to respond per category rather than applying a single blanket block.

Input and Output Moderation

A key strength of Llama Guard is that it works on both sides of a conversation. You can run it on the user's prompt (input moderation) to catch a harmful request before generation, and you can run it again on the model's response (output moderation) to catch a harmful completion before it reaches the user. Meta notes a tradeoff here: input filtering tends to raise the false-refusal rate more than output filtering does, because the classifier sometimes flags borderline prompts that would have produced a perfectly safe answer.

Versions and Sizes

Llama Guard 3 shipped in two forms: an 8B text-only classifier and an 11B vision-capable variant for moderating image plus text content. Llama Guard 4, released in April 2025, consolidates these into a single 12B natively multimodal model. Rather than training from scratch, Meta pruned Llama Guard 4 from the Llama 4 Scout Mixture-of-Experts model down to a dense, shared-expert network, so one model now handles both text and image moderation. As of mid-2026, Llama Guard 4 is the current release.

How to Deploy It

Llama Guard 4 is distributed on Hugging Face as Llama-Guard-4-12B under the meta-llama organization. You can run it locally through standard inference stacks such as the Transformers library or vLLM, call it through Meta's hosted Moderations API, or use it from the Azure model catalog. The deployment choice depends on whether you want full data isolation (run it yourself) or lower operational overhead (use a hosted endpoint).

Practical pattern: Run Llama Guard on input to block clearly harmful requests, and run it again on output to catch anything the model generated that slipped past. Use the per-category labels to tune responses, since not every category warrants the same action.

Prompt Guard: The Injection Detector

Prompt Guard solves a different problem from Llama Guard. Where Llama Guard asks "is this content harmful?", Prompt Guard asks "is this an attack on the system itself?" It classifies an incoming prompt into one of three buckets: benign, prompt injection, or jailbreak. It is built to catch attempts to override your instructions, smuggle in adversarial directives, or trick the model into ignoring its guardrails.

Where It Sits

Prompt Guard is input-only. It runs before the main model, screening the prompt as it arrives. Because it only inspects input, it is fast and cheap to run on every request, making it a natural first line of defense in front of a larger model.

Versions and Sizes

The first release of Prompt Guard was built on mDeBERTa, specifically the mDeBERTa-v3-base architecture, with an 86M-parameter backbone. Prompt Guard 2, released in April 2025 alongside Llama 4, comes in two sizes: an 86M version and a smaller 22M version. The 22M model is designed for latency-sensitive deployments where you want injection screening on every request without adding meaningful overhead. As of mid-2026, Prompt Guard 2 is the current release.

Why pair it with Llama Guard: Meta notes that Llama Guard, being a language model itself, is susceptible to prompt injection. Putting Prompt Guard 2 in front of the stack helps catch injection attempts before they can reach and potentially manipulate the content classifier.

Code Shield: The Insecure-Code Filter

Code Shield targets a risk specific to code-generating assistants. When a model writes code, it can produce snippets that compile and run but contain security vulnerabilities, or in the worst case, malicious patterns. Code Shield is an inference-time filter that inspects generated code and flags insecure output before it is returned to the user or written to a file.

Where It Sits

Code Shield operates on the output side, after the model has generated a response. It is most relevant for coding copilots, agent workflows that execute generated code, and any pipeline where model output flows into a build or runtime. Meta distributes it as part of its open trust-and-safety tooling on GitHub, so you integrate it into your serving pipeline rather than calling a hosted endpoint.

Scope note: Code Shield reduces the chance of shipping insecure generated code, but it is not a replacement for human code review, static analysis, or your existing application security pipeline. Treat it as one more gate, not the only one.

How the Three Tools Compare

The three components are easiest to reason about side by side. Each guards a different stage of the request, and together they cover input attacks, harmful content on both sides, and insecure code on the way out.

Tool	Role	Size	Input / Output	How to Deploy
Llama Guard	Safety classifier: labels content safe or unsafe plus violated categories	v3: 8B text, 11B vision. v4: 12B multimodal	Both input and output	Hugging Face (Llama-Guard-4-12B), Moderations API, Azure
Prompt Guard	Classifies prompts as benign, prompt injection, or jailbreak	v1: 86M (mDeBERTa). v2: 86M and 22M	Input only	Run before the main model in your serving stack
Code Shield	Detects insecure or malicious generated code at inference	Tooling, not a single model size	Output only	Meta open trust-and-safety tools on GitHub

Tool specs: Meta Llama documentation and model cards, as of mid-2026.

Plan Your Layers Before You Build

Before wiring anything together, make a few deliberate decisions. A guardrail stack works best when each layer is chosen on purpose rather than bolted on after an incident. Work through the checklist below to scope your deployment.

Pre-Deployment Checklist

✓

Pick your Llama Guard tier: 8B or 11B (v3) for text or vision, or 12B (v4) for unified multimodal moderation

✓

Decide whether to add Prompt Guard 2 (86M or 22M) in front of the model to catch injection and jailbreak attempts

✓

Decide input vs output filtering, knowing input filtering raises the false-refusal rate more than output filtering

✓

Write a strong system prompt that states allowed and disallowed behavior for your use case

✓

If your app generates code, plan to add Code Shield on the output side alongside human review

✓

Plan for residual risk: monitoring, logging, and a red-team process, because no layer is complete on its own

0 of 6 complete

Build the Stack: A Layered Approach

Assemble the guardrails in the order a request actually travels. Each step adds a layer of defense at a different point in the flow. Use the tracker below to mark your progress.

Step 1: Add Prompt Guard 2 at the Front

Place Prompt Guard 2 first, before the request reaches your main Llama model. Because it screens input only and the 22M variant is small, you can run it on every request with minimal latency. Catching an injection or jailbreak here means it never reaches the model or, importantly, your content classifier.

Step 2: Add Llama Guard on Input and Output

Wrap the model with Llama Guard on both sides. On input, it blocks clearly harmful requests that are not attacks per se but still violate your policy. On output, it catches harmful completions before they return to the user. Remember that input filtering raises false refusals more, so tune your thresholds to your tolerance for over-blocking.

Step 3: Add Code Shield for Code Output

If your application generates code, add Code Shield to the output path. It inspects generated snippets for insecure patterns at inference time. This is most valuable for coding assistants and agent workflows that may execute what the model writes.

Step 4: Set a Strong System Prompt

Pair the classifiers with a clear system prompt that defines allowed and disallowed behavior for your specific use case. Meta found that strong system prompts can reduce unnecessary refusals while keeping genuinely harmful requests blocked, so this is both a safety and a usability lever.

Step 5: Monitor and Red-Team

No stack is finished at deployment. Log flagged requests, watch for drift in false-refusal and miss rates, and run an ongoing red-team process to probe for bypasses. The independent findings later in this guide show why continuous testing matters: attackers adapt, and a fixed set of guardrails will eventually be probed.

Known Limitations and Independent Findings

Meta is unusually direct about what these tools do not do, and independent researchers have published evasion results. Treat the cards below as the honest counterweight to the capability story above.

⚠Weak on S5, S8, and S13

The Llama Guard 4 model card (April 2025) notes the classifier may lack current knowledge for Defamation (S5), Intellectual Property (S8), and Elections (S13). These categories depend on facts that change over time, so do not rely on Llama Guard alone for them.

⚠Susceptible to injection itself

Because Llama Guard is a language model, Meta notes it is itself susceptible to prompt injection. The documented mitigation is to layer Prompt Guard 2 in front of it so injection attempts are caught before they reach the classifier.

⚠About a third of attacks still get through

In a February 2026 test, MindStudio reported that Llama Guard blocked 66.2% of attack prompts, meaning roughly one in three still got through. This is one firm's test result on its prompt set, not a universal pass rate, but it underscores that no classifier is airtight.

⚠Input filtering raises false refusals

Meta documents that input filtering increases the false-refusal rate more than output filtering does. Over-block too aggressively and you frustrate legitimate users, so the input vs output decision is a real tradeoff to tune, not a default to accept.

Separately, in February 2026, security firm Trail of Bits demonstrated prompt-injection techniques that bypassed AI browser assistants by disguising attacker instructions as fake security mechanisms, system instructions, or ordinary user requests. The finding was about the broader pattern of injecting AI assistants, and it reinforces the same lesson: input-screening guardrails reduce risk but determined attackers keep finding new framings.

Fact-checked against vendor documentation and independent research, June 2026. Model versions move quickly. Llama Guard 4 and Prompt Guard 2 are the April 2025 releases and current as of mid-2026. Verify the latest tooling at llama.com before deploying.

Troubleshooting and Common Questions

Common Questions

Do I need all three tools?+

No. The right mix depends on your application. A general chat assistant benefits most from Prompt Guard plus Llama Guard. Code Shield matters specifically when your model generates code that may be executed or shipped. Start with the layers that match your actual risk surface.

My guardrails block too many legitimate requests+

This is the false-refusal problem. Meta notes input filtering raises the false-refusal rate more than output filtering does. Try shifting some checks to the output side, tune your thresholds, and refine the system prompt so it states clearly what is allowed for your use case.

Can an attacker bypass Llama Guard?+

Yes, no classifier is airtight. In a February 2026 test, MindStudio reported Llama Guard blocked 66.2% of attack prompts. Because Llama Guard is itself a language model, Meta notes it can be susceptible to prompt injection, which is why pairing it with Prompt Guard 2 and ongoing red-teaming is recommended.

Where do I get Llama Guard 4?+

Llama Guard 4 is published on Hugging Face as Llama-Guard-4-12B under the meta-llama organization. You can run it locally via the Transformers library or vLLM, call Meta's hosted Moderations API, or use the Azure model catalog.

What is the difference between Llama Guard and Prompt Guard?+

Llama Guard classifies content as safe or unsafe against a 14-category hazard taxonomy and runs on both input and output. Prompt Guard classifies an incoming prompt as benign, prompt injection, or jailbreak, and runs on input only. They solve different problems and are designed to be used together.

Securing an open-weight Llama deployment is an exercise in layering. Prompt Guard 2 screens incoming requests for attacks, Llama Guard classifies content on both sides against the MLCommons hazard taxonomy, Code Shield catches insecure generated code, and a strong system prompt ties the policy together. Each tool is a reference implementation you control, which is the point of open weights: the safety posture is yours to define.

What the documentation and the independent findings make clear is that none of these layers is a guarantee. Meta flags weak spots on Defamation, Intellectual Property, and Elections, and notes that the classifiers can themselves be targeted. Independent tests in early 2026 showed real attacks getting through. The mature approach is to combine the layers, tune for your false-refusal tolerance, and treat monitoring and red-teaming as permanent parts of the system rather than a launch-day checkbox.

Video Resources

▶

Llama Guard Content Moderation Tutorial

YouTube Search

▶

Prompt Guard and Prompt Injection Detection

YouTube Search

▶

Llama Protections Guardrails Guide

YouTube Search

Gallery

Contacts

Mastering Llama Safety and Guardrails: Llama Guard, Prompt Guard, and Code Shield (2026)

The Llama Protections Stack at a Glance

Llama Guard: The Content Classifier

What It Classifies Against

Input and Output Moderation

Versions and Sizes

How to Deploy It

Prompt Guard: The Injection Detector

Where It Sits

Versions and Sizes

Code Shield: The Insecure-Code Filter

Where It Sits

How the Three Tools Compare

Plan Your Layers Before You Build

Build the Stack: A Layered Approach

Step 1: Add Prompt Guard 2 at the Front

Step 2: Add Llama Guard on Input and Output

Step 3: Add Code Shield for Code Output

Step 4: Set a Strong System Prompt

Step 5: Monitor and Red-Team

Known Limitations and Independent Findings

Troubleshooting and Common Questions

Services

Learn

Company