Guardrails & content moderation
A capable model is not a safe product. Guardrails are the checks that sit around a model at runtime — inspecting what goes in and what comes out, and blocking or rewriting anything that breaks policy. Learn the input/output split, rule-based vs. classifier-based filtering, and watch a request flow through a guardrail pipeline right here on the page.
01What a guardrail actually is
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
A guardrail is a control that constrains an AI application's behavior at runtime — it checks, filters, transforms, or blocks the text going into a model and the text coming back out, measuring each against rules you define. NVIDIA's NeMo Guardrails frames these as "programmable rails" added around a conversational system, not baked into the model itself. That distinction matters: the model is the engine, and guardrails are the safety equipment bolted on around it.
Guardrails are not the same thing as training the model to be safe. Safety can be applied at training time — Anthropic's Constitutional AI, for example, encodes normative principles into how the model behaves — and at runtime, through the input/output checks this lesson is about. Vendor and standards guidance treat these as complementary layers, not substitutes: a well-aligned model still needs runtime checks, because no single layer catches everything.
- Runtime, not retraining: guardrails act while the app is running — on each request and response — so you can change policy without touching the model.
- Around the model: NeMo Guardrails calls them programmable rails placed around a conversational LLM system.
- One layer of many: runtime guardrails sit alongside training-time alignment, human review, and policy — layered, because each catches different failures.
02Two places to check: the way in and the way out
Guardrails fall into two families depending on where in the request they act. Llama Guard, Meta's safeguard model, names this split directly: it runs "prompt classification" on the way in and "response classification" on the way out. Get this mental model and most of the field falls into place.
Input guardrails
Inspect the user's prompt — and any retrieved or document content — before it reaches the model. Typical jobs: detecting prompt-injection or jailbreak attempts, stripping or refusing personal data (PII), and enforcing topic or policy limits. Azure AI Content Safety's Prompt Shields are an input guardrail aimed specifically at jailbreak and indirect-injection attacks.
Output guardrails
Inspect the model's generated response before it is returned or passed to another system. Typical jobs: scoring for toxicity or policy violations, validating format or schema, and checking whether the answer is grounded. This layer is the defense against OWASP's insecure output handling — treating model text as trusted and passing it on unchecked.
- Input guardrails protect the model from bad or malicious input — including indirect injection hidden in retrieved documents.
- Output guardrails protect everyone downstream from bad model output — the answer, the user, and any system that consumes it.
- Real systems run both: an input-only filter can be bypassed by content the model itself produces, which is why output and retrieval checks also matter.
03Two ways to decide: rules vs. classifiers
Whichever side of the model a guardrail sits on, it has to make a decision — allow, block, or rewrite. There are two broad ways to make that call, and mature systems layer both.
A useful detail about classifiers: their scores are likelihoods, not verdicts. Perspective API states plainly that its TOXICITY score is a probability between 0 and 1, and you choose the threshold at which to flag content. Llama Guard works differently again — it is an instruction-tuned model that takes a safety taxonomy as part of its prompt, so you can adapt the categories to your own policy rather than retraining. Either way, the lesson is the same: a classifier gives you a number, and someone still has to decide what number means "block".
- Rule-based = explicit and auditable, but brittle. Classifier-based = flexible, but probabilistic.
- Classifier scores are probabilities; the threshold is a human choice with real false-positive / false-negative trade-offs.
- Llama Guard supplies the taxonomy in the prompt, so its categories can be customised to your policy without retraining.
04See it work: a guardrail pipeline
Here is the whole idea in one place. A request flows left to right: through input guardrails (PII detection, prompt-injection filter, topic/policy check), into the model, then through output guardrails (moderation, format/schema validation, groundedness check). Pick an example request, then toggle each guardrail on or off and watch what gets through. This is a teaching model of the concept — the decisions are illustrative, not a real classifier.
- Turn a layer off and watch unsafe content slip through — each guardrail only catches the thing it is built to catch.
- Block vs. redact: some failures stop the request entirely; others (like PII) can be rewritten so the request continues safely.
- This is why guidance from OWASP and the vendors recommends layering: no single rail is complete on its own.
05Check your understanding
06Take it with you & go deeper
AI Security
Where guardrails fit in the bigger defensive picture — threats, controls, and the OWASP LLM risks.
Read →AI Governance basics
The policies and frameworks — NIST AI RMF, ISO 42001 — that content moderation operationalizes.
Read →Prompt injection & jailbreaks
The attacks input guardrails are built to stop — direct and indirect injection, up close.
Coming soonAI red teaming
Stress-testing a system's guardrails on purpose — how teams probe for the gaps before attackers do.
Coming soon⊕Concept map
The whole lesson at a glance — expand each branch to see the key ideas it covers.
What a guardrail actually is
- A control that constrains an AI app's behavior at runtime — checking, filtering, transforming, or blocking inputs and outputs against defined policy.
- NeMo Guardrails frames these as "programmable rails" added around a conversational system, not baked into the model.
- Runtime guardrails complement training-time alignment (e.g. Anthropic's Constitutional AI), human review, and policy — layers, not substitutes.
Input vs. output guardrails
- Input guardrails inspect the prompt (and retrieved content) before it reaches the model — jailbreak/injection detection, PII stripping, topic limits (e.g. Azure Prompt Shields).
- Output guardrails inspect the generated response before it is returned — toxicity, policy, grounding, and format checks; the defense against OWASP insecure output handling.
- Llama Guard performs both as "prompt classification" and "response classification"; real systems run both because input-only filters can be bypassed.
Rule-based vs. classifier-based
- Rule-based = explicit patterns, allow/deny lists, regexes, programmable flows (Colang dialog rails) — precise and auditable, but brittle.
- Classifier-based = ML models scoring content for harm (OpenAI Moderation, Llama Guard, Perspective API, Azure Content Safety) — flexible, but probabilistic.
- Classifier scores are likelihoods, not verdicts — Perspective API returns a 0–1 probability and the human chooses the block threshold; mature systems layer both approaches.
What guardrails enforce
- They mitigate OWASP LLM01 (prompt injection) and LLM05 (insecure output handling) — the threats they are built to catch.
- They operationalize higher-level policy: NIST AI 600-1 MANAGE controls, the AI RMF functions, Google SAIF, and vendor usage policies define what input/output filters enforce.
→Related lessons
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; the pipeline simulator is an illustrative teaching model, labeled as such. Tools and model versions change — confirm current details with the vendor before relying on specifics.
- Overview of NVIDIA NeMo Guardrails — NVIDIA (programmable rails; five rail categories)
- Llama Guard: LLM-based Input-Output Safeguard (arXiv:2312.06674) — Inan et al., Meta
- Validators — Guardrails AI documentation — Guardrails AI
- Moderation — OpenAI API guide — OpenAI
- What is Azure AI Content Safety? — Microsoft (incl. Prompt Shields)
- About the API Score — Perspective API — Jigsaw / Google
- OWASP Top 10 for LLM Applications — OWASP GenAI Security Project
- NIST AI 600-1: Generative AI Profile — NIST
- Claude's Constitution — Anthropic (training-time safety layer)
Guardrails & content moderation — in 8 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
What a guardrail is
A runtime control that checks, filters, transforms, or blocks a model's inputs and outputs against defined policies. NeMo Guardrails calls these "programmable rails" added around the system. It is one layer alongside training-time alignment and human review — not a substitute for them.
Input vs. output guardrails
Input guardrails inspect the prompt (and retrieved content) before the model: injection/jailbreak detection, PII handling, topic limits. Output guardrails inspect the response before it's returned: toxicity/policy scoring, format/schema validation, groundedness. Llama Guard names this split: prompt classification vs. response classification.
Rules vs. classifiers
Rule-based = explicit patterns, allow/deny lists, dialog flows — precise but brittle. Classifier-based = an ML model scores content for harm; the score is a probability, and you choose the threshold. Mature systems layer both. Llama Guard supplies its taxonomy in the prompt, so categories are customizable without retraining.
The pipeline & why it's layered
A request flows through input guardrails → the model → output guardrails. Some failures block the request; others (like PII) are redacted so it continues. No single rail is complete — OWASP and vendor guidance recommend layering input, output, classifier, and human-in-the-loop.