Governance lesson

Track 04 · Governance Intermediate ~8 min

Guardrails & content moderation

A capable model is not a safe product. Guardrails are the checks that sit around a model at runtime — inspecting what goes in and what comes out, and blocking or rewriting anything that breaks policy. Learn the input/output split, rule-based vs. classifier-based filtering, and watch a request flow through a guardrail pipeline right here on the page.

Module progress

01What a guardrail actually is

A guardrail is a control that constrains an AI application's behavior at runtime — it checks, filters, transforms, or blocks the text going into a model and the text coming back out, measuring each against rules you define. NVIDIA's NeMo Guardrails frames these as "programmable rails" added around a conversational system, not baked into the model itself. That distinction matters: the model is the engine, and guardrails are the safety equipment bolted on around it.

Guardrails are not the same thing as training the model to be safe. Safety can be applied at training time — Anthropic's Constitutional AI, for example, encodes normative principles into how the model behaves — and at runtime, through the input/output checks this lesson is about. Vendor and standards guidance treat these as complementary layers, not substitutes: a well-aligned model still needs runtime checks, because no single layer catches everything.

Runtime, not retraining: guardrails act while the app is running — on each request and response — so you can change policy without touching the model.
Around the model: NeMo Guardrails calls them programmable rails placed around a conversational LLM system.
One layer of many: runtime guardrails sit alongside training-time alignment, human review, and policy — layered, because each catches different failures.

02Two places to check: the way in and the way out

Guardrails fall into two families depending on where in the request they act. Llama Guard, Meta's safeguard model, names this split directly: it runs "prompt classification" on the way in and "response classification" on the way out. Get this mental model and most of the field falls into place.

Before the model

Input guardrails

Inspect the user's prompt — and any retrieved or document content — before it reaches the model. Typical jobs: detecting prompt-injection or jailbreak attempts, stripping or refusing personal data (PII), and enforcing topic or policy limits. Azure AI Content Safety's Prompt Shields are an input guardrail aimed specifically at jailbreak and indirect-injection attacks.

Catches: "ignore your instructions and…", a customer's credit-card number pasted into a prompt, an off-policy request.

After the model

Output guardrails

Inspect the model's generated response before it is returned or passed to another system. Typical jobs: scoring for toxicity or policy violations, validating format or schema, and checking whether the answer is grounded. This layer is the defense against OWASP's insecure output handling — treating model text as trusted and passing it on unchecked.

Catches: toxic or disallowed content, a reply that leaks data, malformed JSON, output later run as code or a database query.

Input guardrails protect the model from bad or malicious input — including indirect injection hidden in retrieved documents.
Output guardrails protect everyone downstream from bad model output — the answer, the user, and any system that consumes it.
Real systems run both: an input-only filter can be bypassed by content the model itself produces, which is why output and retrieval checks also matter.

03Two ways to decide: rules vs. classifiers

Whichever side of the model a guardrail sits on, it has to make a decision — allow, block, or rewrite. There are two broad ways to make that call, and mature systems layer both.

Rule-based

Classifier-based

How it decides

Explicit patterns: regexes, allow/deny lists, keyword filters, programmable dialog flows.

A machine-learning model scores content for harm, then a threshold turns the score into a decision.

Examples

NeMo Guardrails' Colang dialog rails; allow/deny lists in NIST AI 600-1's MANAGE controls.

OpenAI Moderation, Llama Guard, Perspective API, Azure AI Content Safety severity levels.

Strength

Precise, predictable, auditable — does exactly what's written.

Generalises to wording you never anticipated; handles nuance and paraphrase.

Weakness

Brittle — misses anything not explicitly listed; easy to evade with rephrasing.

Probabilistic — has false positives and false negatives; quality varies by language.

A useful detail about classifiers: their scores are likelihoods, not verdicts. Perspective API states plainly that its TOXICITY score is a probability between 0 and 1, and you choose the threshold at which to flag content. Llama Guard works differently again — it is an instruction-tuned model that takes a safety taxonomy as part of its prompt, so you can adapt the categories to your own policy rather than retraining. Either way, the lesson is the same: a classifier gives you a number, and someone still has to decide what number means "block".

Rule-based = explicit and auditable, but brittle. Classifier-based = flexible, but probabilistic.
Classifier scores are probabilities; the threshold is a human choice with real false-positive / false-negative trade-offs.
Llama Guard supplies the taxonomy in the prompt, so its categories can be customised to your policy without retraining.

04See it work: a guardrail pipeline

Here is the whole idea in one place. A request flows left to right: through input guardrails (PII detection, prompt-injection filter, topic/policy check), into the model, then through output guardrails (moderation, format/schema validation, groundedness check). Pick an example request, then toggle each guardrail on or off and watch what gets through. This is a teaching model of the concept — the decisions are illustrative, not a real classifier.

InteractivePick a request, toggle the layers

1 · Choose an example request

2 · Turn guardrail layers on or off

Allowed Pick a request and toggle layers to see what happens.

Turn a layer off and watch unsafe content slip through — each guardrail only catches the thing it is built to catch.
Block vs. redact: some failures stop the request entirely; others (like PII) can be rewritten so the request continues safely.
This is why guidance from OWASP and the vendors recommends layering: no single rail is complete on its own.

05Check your understanding

TJS Quiz

06Take it with you & go deeper

"Guardrails & content moderation" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live lesson

AI Security

Where guardrails fit in the bigger defensive picture — threats, controls, and the OWASP LLM risks.

Read →

Live lesson

AI Governance basics

The policies and frameworks — NIST AI RMF, ISO 42001 — that content moderation operationalizes.

Read →

▸ Coming next — deeper progression

Coming soon

Prompt injection & jailbreaks

The attacks input guardrails are built to stop — direct and indirect injection, up close.

Coming soon

AI red teaming

Stress-testing a system's guardrails on purpose — how teams probe for the gaps before attackers do.

Coming soon

⊕Concept map

The whole lesson at a glance — expand each branch to see the key ideas it covers.

What a guardrail actually is

A control that constrains an AI app's behavior at runtime — checking, filtering, transforming, or blocking inputs and outputs against defined policy.
NeMo Guardrails frames these as "programmable rails" added around a conversational system, not baked into the model.
Runtime guardrails complement training-time alignment (e.g. Anthropic's Constitutional AI), human review, and policy — layers, not substitutes.

Input vs. output guardrails

Input guardrails inspect the prompt (and retrieved content) before it reaches the model — jailbreak/injection detection, PII stripping, topic limits (e.g. Azure Prompt Shields).
Output guardrails inspect the generated response before it is returned — toxicity, policy, grounding, and format checks; the defense against OWASP insecure output handling.
Llama Guard performs both as "prompt classification" and "response classification"; real systems run both because input-only filters can be bypassed.

Rule-based vs. classifier-based

Rule-based = explicit patterns, allow/deny lists, regexes, programmable flows (Colang dialog rails) — precise and auditable, but brittle.
Classifier-based = ML models scoring content for harm (OpenAI Moderation, Llama Guard, Perspective API, Azure Content Safety) — flexible, but probabilistic.
Classifier scores are likelihoods, not verdicts — Perspective API returns a 0–1 probability and the human chooses the block threshold; mature systems layer both approaches.

What guardrails enforce

They mitigate OWASP LLM01 (prompt injection) and LLM05 (insecure output handling) — the threats they are built to catch.
They operationalize higher-level policy: NIST AI 600-1 MANAGE controls, the AI RMF functions, Google SAIF, and vendor usage policies define what input/output filters enforce.

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; the pipeline simulator is an illustrative teaching model, labeled as such. Tools and model versions change — confirm current details with the vendor before relying on specifics.

Overview of NVIDIA NeMo Guardrails — NVIDIA (programmable rails; five rail categories)
Llama Guard: LLM-based Input-Output Safeguard (arXiv:2312.06674) — Inan et al., Meta
Validators — Guardrails AI documentation — Guardrails AI
Moderation — OpenAI API guide — OpenAI
What is Azure AI Content Safety? — Microsoft (incl. Prompt Shields)
About the API Score — Perspective API — Jigsaw / Google
OWASP Top 10 for LLM Applications — OWASP GenAI Security Project
NIST AI 600-1: Generative AI Profile — NIST
Claude's Constitution — Anthropic (training-time safety layer)

Guardrails & content moderation — in 8 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What a guardrail is

A runtime control that checks, filters, transforms, or blocks a model's inputs and outputs against defined policies. NeMo Guardrails calls these "programmable rails" added around the system. It is one layer alongside training-time alignment and human review — not a substitute for them.

Input vs. output guardrails

Input guardrails inspect the prompt (and retrieved content) before the model: injection/jailbreak detection, PII handling, topic limits. Output guardrails inspect the response before it's returned: toxicity/policy scoring, format/schema validation, groundedness. Llama Guard names this split: prompt classification vs. response classification.

Rules vs. classifiers

Rule-based = explicit patterns, allow/deny lists, dialog flows — precise but brittle. Classifier-based = an ML model scores content for harm; the score is a probability, and you choose the threshold. Mature systems layer both. Llama Guard supplies its taxonomy in the prompt, so categories are customizable without retraining.

The pipeline & why it's layered

A request flows through input guardrails → the model → output guardrails. Some failures block the request; others (like PII) are redacted so it continues. No single rail is complete — OWASP and vendor guidance recommend layering input, output, classifier, and human-in-the-loop.

Gallery

Contacts

Guardrails & content moderation

01What a guardrail actually is