Governance lesson

Track 04 · Governance Advanced ~9 min

What is AI red teaming?

Before a model ships, someone has to try to break it on purpose. AI red teaming is the structured, adversarial testing of AI systems to find harmful or unsafe behavior — jailbreaks, prompt injection, data extraction — before real users do. Learn how it works, the frameworks that structure it, how findings turn into fixes, and one crucial caveat: it reduces risk, it never eliminates it.

Module progress

01What red teaming is — and why AI needs its own kind

Red teaming borrows a military and cybersecurity idea: a dedicated team plays the adversary, attacking your own system so you find the weaknesses before a real attacker does. AI red teaming applies that to AI and machine-learning systems — especially large language models and generative AI — through structured, adversarial testing designed to discover, measure, and help reduce harmful, unsafe, or insecure behavior, both before and after deployment.

Here's the key difference from classic security work. Traditional red teaming targets infrastructure — networks, applications, identities, the things attackers have always gone after. AI red teaming targets model-specific failure modes that simply didn't exist before: adversarial inputs, prompt injection, jailbreaks that talk a model past its guardrails, attempts to extract training data, model backdooring, and data poisoning. Doing that well takes traditional security skill plus AI subject-matter expertise — you need to understand how the model behaves, not just how the server does (Google AI Red Team report; MITRE ATLAS).

Goal: find harmful or unsafe behavior on purpose, so it can be measured and reduced before users hit it.
What's new: the attack surface is the model's behavior — jailbreaks, prompt injection, data extraction — not just the surrounding infrastructure.
Who does it: people who combine security skills with AI expertise, working ideally independent of the team that built the model.

02Two ways to attack: human and automated

Red teams work along a spectrum from hands-on human probing to fully programmatic attack generation — and serious programs use both.

Manual (human) red teaming puts expert testers in front of the model to craft attacks and probe for harms at scale. Anthropic's team did exactly this and released a dataset of 38,961 red-team attacks, documenting harm types and lessons learned (Ganguli et al., 2022). A striking finding from that work: RLHF-trained models became harder to red team as they scaled up, while other model types didn't show the same trend — bigger, better-aligned models pushed back more.

Automated red teaming generates large volumes of test cases by program or by model. Perez et al. (2022) showed a "red" language model can automatically write test cases against a target model — methods ranging from zero-shot prompting to reinforcement learning — surfacing tens of thousands of offensive replies from a 280-billion-parameter chatbot. Others, like the GCG attack (Zou et al., 2023), use combined greedy and gradient-based search to find adversarial "suffixes" that maximize the chance of a non-refusing answer; unsettlingly, those suffixes often transfer to black-box, publicly released models. Automation scales coverage; humans bring creativity, judgment, and novel-risk discovery. You want both.

Manual: expert testers, creative and judgment-heavy — best for novel and context-specific harms.
Automated: model- or program-generated attacks (red-LM, GCG) — best for volume and reproducible coverage.
Scaling insight: RLHF models can get harder to break as they scale, but automated transferable attacks remain a live threat.

03The frameworks that give it structure

Red teaming isn't ad-hoc poking. A set of public frameworks gives teams a shared vocabulary, a catalog of risks to cover, and a place for the work in the wider governance picture. Switch between the four most-cited references to see what each one contributes.

InteractiveSwitch the framework

NIST — where red teaming sits in risk management

In the NIST AI Risk Management Framework (Govern / Map / Measure / Manage), red teaming lives inside the Measure function as a risk-assessment method. The companion Generative AI Profile (NIST AI 600-1) recommends red teaming before and after deployment and names four types — general public, expert, combination, and human/AI — while stressing that teams should be independent of the developers. NIST AI 100-2e2025 adds the authoritative adversarial-ML taxonomy that names what to test for.

covers: evasion, poisoning, privacy attacks, direct/indirect prompt injection, agent security

role: red teaming = the "Measure" step, run pre- and post-deployment

MITRE ATLAS — an ATT&CK-style map of AI attacks

MITRE ATLAS is a living knowledge base of adversary tactics and techniques against AI systems, modeled on the familiar ATT&CK framework for traditional security. Teams use it for threat modeling — turning "what could go wrong?" into a structured plan of attack techniques to attempt — and to anchor tests against documented real-world case studies.

use it for: structuring a red-team plan around named tactics & techniques

style: ATT&CK for AI — tactics, techniques, case studies

OWASP Top 10 for LLM Applications — the risk checklist

The OWASP Top 10 for LLM Applications is the community-consensus catalog of the most critical risks in LLM apps — prompt injection, insecure output handling, training-data poisoning, and more, with the 2025 edition adding agentic-AI risks. Red teams use it as a coverage checklist: have we probed each of these classes?

think of it as: "have we tested for each of these top risks yet?"

flagship risk: prompt injection (direct and indirect)

Government & vendor practice — who actually runs it

Major labs run dedicated teams and external networks: Google (Secure AI Framework + AI Red Team), Microsoft (AI Red Team + the open-source PyRIT tool, used across 100+ operations), OpenAI (a standing external Red Teaming Network), and Anthropic (a Frontier Red Team governed by its Responsible Scaling Policy / AI Safety Levels). Governments add independent evaluation — the UK AI Security Institute — and secure-development guidance, like the joint CISA / NCSC guidelines.

tooling: PyRIT, GCG reference implementation, HarmBench, JailbreakBench

independent check: UK AISI evaluations + reproducible open benchmarks

04Run a coverage board for yourself

Here's the core loop of a red-team exercise, simplified into something you can drive. You're testing a simulated model against six categories of probe — five you can defend, plus one, novel / unknown attacks, that cannot be pre-tested because it doesn't exist yet. Hit Run probes to see which categories the model currently fails. Then turn on mitigations — the defenses a team would actually deploy — and run again to watch the coverage scorecard climb. The point isn't to learn any attack; it's to feel how defense raises coverage, and to see for yourself that even fully defended, coverage tops out below 100% — the residual category structurally caps it.

InteractiveRun → add mitigations → run again

—coverage

Mitigations to apply

Press Run probes to test the undefended model against all six categories.

Each mitigation closes specific failure categories — no single defense covers everything.
Stacking defenses raises coverage, but a residual gap always remains: red teaming reduces risk, it doesn't eliminate it.
This is a teaching simulation — real coverage depends on the behavior set, the judge, and the threat model, and novel attacks appear after testing.

05From findings to fixes — and how it's measured

Finding a problem is only half the job; red teaming earns its keep when findings feed back into the model. Documented harms become training signal for alignment methods like RLHF and Constitutional AI — the latter has the model critique and revise its own outputs against a written list of principles, reducing harms with fewer human labels (Bai et al., 2022). Other findings harden the system around the model: tighter system prompts, input and output filters, and stricter deployment controls that escalate as capability grows (for example, Anthropic's AI Safety Levels under its Responsible Scaling Policy).

To know whether any of this worked, teams measure with reproducible benchmarks. HarmBench standardizes automated red-teaming evaluation across 18 attack methods and 33 target models and defenses, scoring by attack success rate (ASR). JailbreakBench provides a 200-behavior dataset and a public leaderboard so attacks and defenses can be compared on equal footing. One honest caveat: ASR and benchmark numbers depend on the judge model, the behavior set, and the threat model, so scores are not directly comparable across papers unless they use the same protocol.

Findings feed mitigation: RLHF and Constitutional AI turn documented harms into training signal.
System hardening: system prompts, input/output filters, and capability-tiered deployment controls.
Measured with: HarmBench and JailbreakBench (attack success rate) — reproducible, but protocol-dependent.

06Check your understanding

TJS Quiz

07Take it with you & go deeper

"AI red teaming" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live lesson

Prompt injection & jailbreaks

The flagship attack class red teams probe for — how it works and why it's hard to stop.

Read →

Live lesson

Guardrails & content moderation

The input/output filters that turn red-team findings into deployed defenses.

Read →

▸ The wider governance picture

Live lesson

AI governance, explained

Where red teaming fits in the bigger frame of risk management, oversight, and accountability.

Read →

Live lesson

AI security fundamentals

The attack-and-defense landscape for AI systems that red teaming stress-tests.

Read →

⊕Concept map

The whole lesson at a glance — expand each branch to see the key ideas it covers.

What AI red teaming is

Goal: deliberately find harmful or unsafe behavior so it can be measured and reduced before users hit it.
What's new: the attack surface is the model's behavior — jailbreaks, prompt injection, data extraction — not just the surrounding infrastructure.
Who does it: people who combine security skills with AI expertise, working ideally independent of the team that built the model.

Human vs. automated attacks

Manual: expert testers, creative and judgment-heavy — best for novel and context-specific harms (Anthropic's 38,961-attack dataset, Ganguli et al. 2022).
Automated: model- or program-generated attacks (red-LM, GCG) — best for volume and reproducible coverage.
Scaling insight: RLHF models can get harder to break as they scale, but automated transferable attacks remain a live threat.

The frameworks that give it structure

NIST AI RMF: red teaming sits in the Measure function; the GenAI Profile (AI 600-1) recommends it before and after deployment by independent teams.
MITRE ATLAS is an ATT&CK-style threat KB for AI attacks; the OWASP Top 10 for LLM Applications is the risk-coverage checklist.
Government & vendor practice: Google SAIF, Microsoft PyRIT, OpenAI's external network, Anthropic's Frontier Red Team, UK AISI, joint CISA/NCSC guidance.

From findings to fixes — and how it's measured

Findings feed mitigation: RLHF and Constitutional AI turn documented harms into training signal.
System hardening: system prompts, input/output filters, and capability-tiered deployment controls.
Measured with HarmBench and JailbreakBench (attack success rate) — reproducible, but protocol-dependent and not comparable across papers.

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below. Figures shown are attributed to their source; the coverage board is a teaching simulation and is labelled as such. Adversarial-attack research is dual-use — cited here only to explain how systems are hardened, never to enable misuse.

NIST AI 600-1 — Generative AI Profile — NIST
NIST AI 100-2e2025 — Adversarial ML: A Taxonomy of Attacks and Mitigations — NIST
MITRE ATLAS — Adversarial Threat Landscape for AI Systems — MITRE
OWASP Top 10 for LLM Applications — OWASP Gen AI Security Project
Red Teaming Language Models to Reduce Harms — Ganguli et al. (Anthropic)
Red Teaming Language Models with Language Models — Perez et al. (DeepMind)
Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG) — Zou et al.
Constitutional AI: Harmlessness from AI Feedback — Bai et al. (Anthropic)
HarmBench — Standardized Evaluation for Automated Red Teaming — Center for AI Safety
JailbreakBench — An Open Robustness Benchmark — Chao et al.
PyRIT — Python Risk Identification Tool — Microsoft / Azure
Why Red Teams Play a Central Role in Securing AI Systems — Google AI Red Team
Anthropic's Responsible Scaling Policy — Anthropic
AISI Frontier AI Trends Report (2025) — UK AI Security Institute

Responsible AI note

This is an educational, defensive overview of how AI systems are tested and hardened. It deliberately contains no operational exploit instructions. AI red teaming reduces risk but does not eliminate it — passing a red-team exercise is evidence of coverage and effort, not a guarantee of safety. For decisions with real-world stakes, rely on qualified professionals and independent evaluation, not a model's output alone.

AI red teaming — in one page

Tech Jacks Solutions · AI Knowledge Hub · educational summary

What it is

Structured, adversarial testing of AI systems to discover, measure, and help reduce harmful, unsafe, or insecure behavior — before and after deployment. Unlike traditional security red teaming (infrastructure), it targets model-specific failure modes: jailbreaks, prompt injection, data extraction, poisoning, backdooring. It needs AI expertise on top of security skills.

Human vs automated

Manual: expert testers craft attacks (Anthropic released a 38,961-attack dataset; RLHF models got harder to red team as they scaled). Automated: a "red" LM generates test cases (Perez et al.) or gradient search finds transferable adversarial suffixes (GCG, Zou et al.). Serious programs use both.

Frameworks

NIST AI RMF (red teaming = the "Measure" function); NIST AI 600-1 recommends pre/post-deployment red teaming by independent teams; NIST AI 100-2e2025 is the adversarial-ML taxonomy. MITRE ATLAS = ATT&CK-style attack knowledge base. OWASP Top 10 for LLM Applications = the risk checklist.

Findings to fixes, and measurement

Findings feed alignment (RLHF, Constitutional AI) and system hardening (system prompts, input/output filters, capability-tiered controls). Progress is measured with HarmBench and JailbreakBench via attack success rate — but scores are only comparable under the same protocol.

The honest caveat

Red teaming reduces risk; it never eliminates it. Coverage is finite, novel attacks emerge after testing, and vendor results are largely self-reported — so it must be continuous and cross-checked by independent evaluation.

Gallery

Contacts

What is AI red teaming?

01What red teaming is — and why AI needs its own kind

02Two ways to attack: human and automated

03The frameworks that give it structure

NIST — where red teaming sits in risk management

MITRE ATLAS — an ATT&CK-style map of AI attacks

OWASP Top 10 for LLM Applications — the risk checklist

Government & vendor practice — who actually runs it

04Run a coverage board for yourself

05From findings to fixes — and how it's measured

06Check your understanding