What is AI red teaming?
Before a model ships, someone has to try to break it on purpose. AI red teaming is the structured, adversarial testing of AI systems to find harmful or unsafe behavior — jailbreaks, prompt injection, data extraction — before real users do. Learn how it works, the frameworks that structure it, how findings turn into fixes, and one crucial caveat: it reduces risk, it never eliminates it.
01What red teaming is — and why AI needs its own kind
The AI Governance Charter — establish ownership, scope, and accountability for AI.
Get the charter Browse all templatesYour purchase helps keep our hubs free to read.
Red teaming borrows a military and cybersecurity idea: a dedicated team plays the adversary, attacking your own system so you find the weaknesses before a real attacker does. AI red teaming applies that to AI and machine-learning systems — especially large language models and generative AI — through structured, adversarial testing designed to discover, measure, and help reduce harmful, unsafe, or insecure behavior, both before and after deployment.
Here's the key difference from classic security work. Traditional red teaming targets infrastructure — networks, applications, identities, the things attackers have always gone after. AI red teaming targets model-specific failure modes that simply didn't exist before: adversarial inputs, prompt injection, jailbreaks that talk a model past its guardrails, attempts to extract training data, model backdooring, and data poisoning. Doing that well takes traditional security skill plus AI subject-matter expertise — you need to understand how the model behaves, not just how the server does (Google AI Red Team report; MITRE ATLAS).
- Goal: find harmful or unsafe behavior on purpose, so it can be measured and reduced before users hit it.
- What's new: the attack surface is the model's behavior — jailbreaks, prompt injection, data extraction — not just the surrounding infrastructure.
- Who does it: people who combine security skills with AI expertise, working ideally independent of the team that built the model.
02Two ways to attack: human and automated
Red teams work along a spectrum from hands-on human probing to fully programmatic attack generation — and serious programs use both.
Manual (human) red teaming puts expert testers in front of the model to craft attacks and probe for harms at scale. Anthropic's team did exactly this and released a dataset of 38,961 red-team attacks, documenting harm types and lessons learned (Ganguli et al., 2022). A striking finding from that work: RLHF-trained models became harder to red team as they scaled up, while other model types didn't show the same trend — bigger, better-aligned models pushed back more.
Automated red teaming generates large volumes of test cases by program or by model. Perez et al. (2022) showed a "red" language model can automatically write test cases against a target model — methods ranging from zero-shot prompting to reinforcement learning — surfacing tens of thousands of offensive replies from a 280-billion-parameter chatbot. Others, like the GCG attack (Zou et al., 2023), use combined greedy and gradient-based search to find adversarial "suffixes" that maximize the chance of a non-refusing answer; unsettlingly, those suffixes often transfer to black-box, publicly released models. Automation scales coverage; humans bring creativity, judgment, and novel-risk discovery. You want both.
- Manual: expert testers, creative and judgment-heavy — best for novel and context-specific harms.
- Automated: model- or program-generated attacks (red-LM, GCG) — best for volume and reproducible coverage.
- Scaling insight: RLHF models can get harder to break as they scale, but automated transferable attacks remain a live threat.
03The frameworks that give it structure
Red teaming isn't ad-hoc poking. A set of public frameworks gives teams a shared vocabulary, a catalog of risks to cover, and a place for the work in the wider governance picture. Switch between the four most-cited references to see what each one contributes.
NIST — where red teaming sits in risk management
In the NIST AI Risk Management Framework (Govern / Map / Measure / Manage), red teaming lives inside the Measure function as a risk-assessment method. The companion Generative AI Profile (NIST AI 600-1) recommends red teaming before and after deployment and names four types — general public, expert, combination, and human/AI — while stressing that teams should be independent of the developers. NIST AI 100-2e2025 adds the authoritative adversarial-ML taxonomy that names what to test for.
MITRE ATLAS — an ATT&CK-style map of AI attacks
MITRE ATLAS is a living knowledge base of adversary tactics and techniques against AI systems, modeled on the familiar ATT&CK framework for traditional security. Teams use it for threat modeling — turning "what could go wrong?" into a structured plan of attack techniques to attempt — and to anchor tests against documented real-world case studies.
OWASP Top 10 for LLM Applications — the risk checklist
The OWASP Top 10 for LLM Applications is the community-consensus catalog of the most critical risks in LLM apps — prompt injection, insecure output handling, training-data poisoning, and more, with the 2025 edition adding agentic-AI risks. Red teams use it as a coverage checklist: have we probed each of these classes?
Government & vendor practice — who actually runs it
Major labs run dedicated teams and external networks: Google (Secure AI Framework + AI Red Team), Microsoft (AI Red Team + the open-source PyRIT tool, used across 100+ operations), OpenAI (a standing external Red Teaming Network), and Anthropic (a Frontier Red Team governed by its Responsible Scaling Policy / AI Safety Levels). Governments add independent evaluation — the UK AI Security Institute — and secure-development guidance, like the joint CISA / NCSC guidelines.
04Run a coverage board for yourself
Here's the core loop of a red-team exercise, simplified into something you can drive. You're testing a simulated model against six categories of probe — five you can defend, plus one, novel / unknown attacks, that cannot be pre-tested because it doesn't exist yet. Hit Run probes to see which categories the model currently fails. Then turn on mitigations — the defenses a team would actually deploy — and run again to watch the coverage scorecard climb. The point isn't to learn any attack; it's to feel how defense raises coverage, and to see for yourself that even fully defended, coverage tops out below 100% — the residual category structurally caps it.
- Each mitigation closes specific failure categories — no single defense covers everything.
- Stacking defenses raises coverage, but a residual gap always remains: red teaming reduces risk, it doesn't eliminate it.
- This is a teaching simulation — real coverage depends on the behavior set, the judge, and the threat model, and novel attacks appear after testing.
05From findings to fixes — and how it's measured
Finding a problem is only half the job; red teaming earns its keep when findings feed back into the model. Documented harms become training signal for alignment methods like RLHF and Constitutional AI — the latter has the model critique and revise its own outputs against a written list of principles, reducing harms with fewer human labels (Bai et al., 2022). Other findings harden the system around the model: tighter system prompts, input and output filters, and stricter deployment controls that escalate as capability grows (for example, Anthropic's AI Safety Levels under its Responsible Scaling Policy).
To know whether any of this worked, teams measure with reproducible benchmarks. HarmBench standardizes automated red-teaming evaluation across 18 attack methods and 33 target models and defenses, scoring by attack success rate (ASR). JailbreakBench provides a 200-behavior dataset and a public leaderboard so attacks and defenses can be compared on equal footing. One honest caveat: ASR and benchmark numbers depend on the judge model, the behavior set, and the threat model, so scores are not directly comparable across papers unless they use the same protocol.
- Findings feed mitigation: RLHF and Constitutional AI turn documented harms into training signal.
- System hardening: system prompts, input/output filters, and capability-tiered deployment controls.
- Measured with: HarmBench and JailbreakBench (attack success rate) — reproducible, but protocol-dependent.
06Check your understanding
07Take it with you & go deeper
Prompt injection & jailbreaks
The flagship attack class red teams probe for — how it works and why it's hard to stop.
Read →Guardrails & content moderation
The input/output filters that turn red-team findings into deployed defenses.
Read →⊕Concept map
The whole lesson at a glance — expand each branch to see the key ideas it covers.
What AI red teaming is
- Goal: deliberately find harmful or unsafe behavior so it can be measured and reduced before users hit it.
- What's new: the attack surface is the model's behavior — jailbreaks, prompt injection, data extraction — not just the surrounding infrastructure.
- Who does it: people who combine security skills with AI expertise, working ideally independent of the team that built the model.
Human vs. automated attacks
- Manual: expert testers, creative and judgment-heavy — best for novel and context-specific harms (Anthropic's 38,961-attack dataset, Ganguli et al. 2022).
- Automated: model- or program-generated attacks (red-LM, GCG) — best for volume and reproducible coverage.
- Scaling insight: RLHF models can get harder to break as they scale, but automated transferable attacks remain a live threat.
The frameworks that give it structure
- NIST AI RMF: red teaming sits in the Measure function; the GenAI Profile (AI 600-1) recommends it before and after deployment by independent teams.
- MITRE ATLAS is an ATT&CK-style threat KB for AI attacks; the OWASP Top 10 for LLM Applications is the risk-coverage checklist.
- Government & vendor practice: Google SAIF, Microsoft PyRIT, OpenAI's external network, Anthropic's Frontier Red Team, UK AISI, joint CISA/NCSC guidance.
From findings to fixes — and how it's measured
- Findings feed mitigation: RLHF and Constitutional AI turn documented harms into training signal.
- System hardening: system prompts, input/output filters, and capability-tiered deployment controls.
- Measured with HarmBench and JailbreakBench (attack success rate) — reproducible, but protocol-dependent and not comparable across papers.
→Related lessons
- → AI Governance Explained: Frameworks & Basics 2026
- → AI Evaluation & Benchmarks
- → AI Security Explained: Risks & Defenses (2026)
- → AI Ethics & Bias Explained: A 2026 Guide
- → AI Alignment & RLHF Explained (2026 Guide)
- → Model Quantization & Local AI Explained (2026)
- → AI Training Data & Synthetic Data Explained 2026
- → Foundation & Frontier AI Models Explained 2026
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below. Figures shown are attributed to their source; the coverage board is a teaching simulation and is labelled as such. Adversarial-attack research is dual-use — cited here only to explain how systems are hardened, never to enable misuse.
- NIST AI 600-1 — Generative AI Profile — NIST
- NIST AI 100-2e2025 — Adversarial ML: A Taxonomy of Attacks and Mitigations — NIST
- MITRE ATLAS — Adversarial Threat Landscape for AI Systems — MITRE
- OWASP Top 10 for LLM Applications — OWASP Gen AI Security Project
- Red Teaming Language Models to Reduce Harms — Ganguli et al. (Anthropic)
- Red Teaming Language Models with Language Models — Perez et al. (DeepMind)
- Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG) — Zou et al.
- Constitutional AI: Harmlessness from AI Feedback — Bai et al. (Anthropic)
- HarmBench — Standardized Evaluation for Automated Red Teaming — Center for AI Safety
- JailbreakBench — An Open Robustness Benchmark — Chao et al.
- PyRIT — Python Risk Identification Tool — Microsoft / Azure
- Why Red Teams Play a Central Role in Securing AI Systems — Google AI Red Team
- Anthropic's Responsible Scaling Policy — Anthropic
- AISI Frontier AI Trends Report (2025) — UK AI Security Institute
This is an educational, defensive overview of how AI systems are tested and hardened. It deliberately contains no operational exploit instructions. AI red teaming reduces risk but does not eliminate it — passing a red-team exercise is evidence of coverage and effort, not a guarantee of safety. For decisions with real-world stakes, rely on qualified professionals and independent evaluation, not a model's output alone.
AI red teaming — in one page
Tech Jacks Solutions · AI Knowledge Hub · educational summary
What it is
Structured, adversarial testing of AI systems to discover, measure, and help reduce harmful, unsafe, or insecure behavior — before and after deployment. Unlike traditional security red teaming (infrastructure), it targets model-specific failure modes: jailbreaks, prompt injection, data extraction, poisoning, backdooring. It needs AI expertise on top of security skills.
Human vs automated
Manual: expert testers craft attacks (Anthropic released a 38,961-attack dataset; RLHF models got harder to red team as they scaled). Automated: a "red" LM generates test cases (Perez et al.) or gradient search finds transferable adversarial suffixes (GCG, Zou et al.). Serious programs use both.
Frameworks
NIST AI RMF (red teaming = the "Measure" function); NIST AI 600-1 recommends pre/post-deployment red teaming by independent teams; NIST AI 100-2e2025 is the adversarial-ML taxonomy. MITRE ATLAS = ATT&CK-style attack knowledge base. OWASP Top 10 for LLM Applications = the risk checklist.
Findings to fixes, and measurement
Findings feed alignment (RLHF, Constitutional AI) and system hardening (system prompts, input/output filters, capability-tiered controls). Progress is measured with HarmBench and JailbreakBench via attack success rate — but scores are only comparable under the same protocol.
The honest caveat
Red teaming reduces risk; it never eliminates it. Coverage is finite, novel attacks emerge after testing, and vendor results are largely self-reported — so it must be continuous and cross-checked by independent evaluation.