Governance lesson

Track 05 · Governance Intermediate ~9 min

Why did the model decide that?

Modern AI can make a decision without telling you why. Explainability and interpretability are the tools we use to ask "why?" of an otherwise opaque model — from highlighting which inputs pushed a prediction, to reverse-engineering the circuits inside the network. Learn how the methods work, what each one can and cannot tell you, and the one trap to watch for: an explanation can be convincing and still be wrong.

Module progress

01Explainability vs. interpretability

These two words are used almost interchangeably, and the field has never fully agreed on a single definition — Lipton's "The Mythos of Model Interpretability" documents how researchers mean different, sometimes conflicting, things by them. Still, a useful working distinction has settled in. Interpretability is usually about how far a human can understand the model's mechanism directly — easiest when the model is simple and transparent, like a short decision tree or a linear formula. Explainability (XAI) is about producing a human-understandable account of why a model — even an opaque one — produced a given output, often using methods bolted on after training. Government framing reflects the stakes: NIST's Four Principles of Explainable AI and DARPA's XAI program both push for systems that can justify their outputs so people can trust and manage them appropriately.

Interpretability ≈ you can read the mechanism itself (a transparent model you can follow by hand).
Explainability ≈ you generate an after-the-fact account of why an opaque model decided what it did.
The distinction is a convention, not a standard — don't treat any one definition as universally accepted (Lipton, 2016).

02Four questions that classify any method

Almost every explainability technique can be placed by asking four questions. Switch between them to see what each axis means and where common methods land.

InteractiveSwitch the axis

Intrinsic vs. post-hoc — when the explanation comes from

Intrinsic interpretability comes from choosing an inherently transparent model: linear or logistic regression, a shallow decision tree, a rule list. Post-hoc methods are applied after training to a fixed, possibly opaque model — LIME, SHAP, Integrated Gradients, Grad-CAM, counterfactuals. The catch: post-hoc methods approximate the model, so they are not guaranteed to be faithful to its real computation.

intrinsic: a logistic-regression credit score you can read coefficient by coefficient

post-hoc: SHAP values explaining a gradient-boosted model after it is trained

Local vs. global — how much it covers

A local explanation accounts for one prediction — a single SHAP value vector, one LIME surrogate, a saliency map for one image. A global explanation describes the model's overall behaviour across the whole dataset — permutation feature importance, partial dependence, or SP-LIME picking representative instances.

local: "for this applicant, debt-to-income pushed the decision toward deny"

global: "across all applicants, income is the most important feature overall"

Model-agnostic vs. model-specific — what access it needs

Model-agnostic methods treat the model as a black box, querying only inputs and outputs — LIME, KernelSHAP, permutation importance, counterfactuals. Model-specific methods exploit internal structure such as gradients or activations — Integrated Gradients, Grad-CAM, saliency maps, and TreeSHAP for tree ensembles.

agnostic: permutation importance shuffles a column and watches the score drop

specific: Grad-CAM reads gradients at the final convolutional layer of a CNN

Attribution vs. counterfactual — the shape of the answer

An attribution assigns credit to inputs: "these features mattered, and by this much." A counterfactual (Wachter et al., 2017) instead describes the smallest change to the inputs that would flip the decision — "approve this loan if income were $6k higher" — without exposing internal logic. Counterfactuals serve understanding, contestability, and recourse, and are discussed in the context of the GDPR.

attribution: SHAP says income contributed +0.3 to the score

counterfactual: "raise income by $6,000 and the loan flips to approve"

03The methods that do the explaining

A handful of post-hoc methods do most of the heavy lifting in practice. LIME (Ribeiro et al., 2016) explains one prediction by sampling perturbed versions of the input, weighting them by closeness, and fitting a simple interpretable surrogate — usually a sparse linear model — whose coefficients become the explanation. SHAP (Lundberg & Lee, 2017) assigns each feature a Shapley value from cooperative game theory: its average marginal contribution across all coalitions of features. SHAP unifies a whole family of additive methods and is the unique solution satisfying local accuracy, missingness, and consistency; TreeSHAP computes it exactly and fast for tree ensembles.

For neural networks, gradient-based methods dominate. Saliency maps (Simonyan et al., 2013) use the gradient of the output with respect to input pixels to highlight influential regions. Integrated Gradients (Sundararajan et al., 2017) integrates gradients along a straight path from a baseline to the actual input, deliberately satisfying two axioms earlier methods broke: Sensitivity (a feature that changes the output gets nonzero attribution) and Implementation Invariance (functionally identical networks get identical attributions). Grad-CAM (Selvaraju et al., 2016/2017) uses gradients flowing into the last convolutional layer to produce a coarse heatmap of where a CNN "looked." Libraries like Captum (PyTorch) and the SHAP package put these in reach of practitioners.

LIME — perturb around one input, fit a local surrogate; its coefficients are the explanation (model-agnostic, local).
SHAP — game-theoretic Shapley values; additive, consistent feature attributions; TreeSHAP is exact for trees.
Saliency / Integrated Gradients / Grad-CAM — gradient-based, model-specific attributions for neural nets and CNNs.
Permutation importance — shuffle a feature, measure the score drop; a model-agnostic global view (scikit-learn).

04See it: which features drove this decision?

Below is a simulated loan-approval model that has denied one applicant. There is no real model here — the contributions are illustrative, hand-set to show how an attribution explanation reads. Each bar shows how much a feature pushed the decision toward approve (right) or deny (left). Switch the method to watch the same prediction get re-weighted: different techniques can emphasise different features and even disagree — which is exactly why a single explanation is evidence, not proof.

Interactive · simulatedSwitch the method

Applicant: requested $28,000 · the simulated model's output:

DENY · 0.38

pushes toward deny pushes toward approve bar length = strength of contribution (illustrative)

Post-hoc, not mechanistic. This kind of attribution explains the model's input–output relationship — which inputs the model leaned on for this prediction. It does not reveal the internal computation that produced the answer, and it does not prove real-world causation: the pattern it surfaces may just be a dataset bias the model learned. Reading the actual machinery is the job of mechanistic interpretability, covered next.

05Opening the box: mechanistic interpretability

Attribution tells you which inputs mattered. Mechanistic interpretability — a program associated with Chris Olah and the Distill Circuits thread — goes further: it tries to reverse-engineer the network's internal computation into human-understandable parts. In this framing, features are directions in activation space that correspond to meaningful concepts, and circuits are the weighted connections between features that implement a computation.

The hard obstacle is superposition (Anthropic, Toy Models of Superposition, 2022): a network represents more features than it has dimensions by packing them into overlapping, non-orthogonal directions. That produces polysemantic neurons — single neurons that fire for several unrelated concepts — so you can't just read one neuron and know what it means. Sparse autoencoders / dictionary learning (Towards Monosemanticity, 2023; Scaling Monosemanticity, 2024) partly undo this by decomposing activations into a much larger set of sparsely active, mostly monosemantic features — and the 2024 work applied it to a production model, Claude 3 Sonnet, extracting millions of interpretable features. More recently, attribution graphs / circuit tracing (Anthropic, 2025) use cross-layer transcoders to map how information flows from input tokens through intermediate features to outputs, revealing multi-step internal reasoning. These are first-party research results and an active, fast-moving area — partial and model-specific, not a finished account of any model.

Features & circuits — meaningful directions in activation space, wired together to implement computations.
Superposition makes neurons polysemantic, which is why per-neuron reading fails; sparse autoencoders recover cleaner features.
Attribution graphs trace internal multi-step reasoning — but findings are partial, model-specific, and self-reported.

06The one trap: a good explanation can still be wrong

This is the most important thing to carry away. NIST's four principles include Explanation Accuracy precisely because an explanation can be plausible yet unfaithful — it can sound convincing while not reflecting what the model actually did. Different attribution methods can disagree on the very same prediction. Saliency maps can be visually compelling and still fail to reveal the spurious correlation a model is really exploiting (Google PAIR's saliency explorable shows exactly this). And attention weights are not a reliable explanation: Jain & Wallace (2019) found attention often diverges from gradient-based importance, and that very different attention patterns can yield identical predictions — a point that is itself contested by later rebuttals, so treat attention-as-explanation as an open debate. The practical rule: treat any single explanation as one piece of evidence, not a verdict, and remember a feature mattering to the model is not the same as it mattering in the world.

Faithfulness is not guaranteed — post-hoc explanations approximate the model (NIST "Explanation Accuracy" exists for this reason).
Methods disagree — cross-check, and don't trust a lone saliency map to expose a hidden bias.
Attention ≠ explanation (Jain & Wallace, 2019) — and even that claim is debated; present it as contested.

07Check your understanding

TJS Quiz

08Take it with you & go deeper

"Explainability & interpretability" — one-page summary

The whole lesson distilled to a printable cheat-sheet.

▸ Already on the site — go deeper

Live lesson

AI governance, explained

Where explainability fits in the bigger picture of accountable, well-managed AI.

Read →

Live lesson

Model cards — documenting AI systems

How teams write down what a model does, its limits, and its intended use.

Read →

▸ Coming next — deeper progression

Coming soon

AI red teaming

Stress-testing models to find failures before attackers and edge cases do.

Coming soon

Guardrails & content moderation

The control layer around a model — what it is allowed to say and do.

Coming soon

→Continue learning

Governance lessonPrompt injection & jailbreaks →The other side of trustworthy AI Related lessonHow neural networks work →The machinery interpretability tries to read

⊕Concept map

The whole lesson at a glance — expand each branch to see the key ideas it covers.

Explainability vs. interpretability

Interpretability ≈ you can read the model's mechanism directly (a transparent model you can follow by hand).
Explainability (XAI) ≈ you generate an after-the-fact account of why an opaque model decided what it did.
The distinction is a convention, not a standard — no single definition is universally accepted (Lipton, 2016).

Four axes that classify any method

Intrinsic vs. post-hoc (when): transparent-by-design models vs. methods applied after training that approximate the model.
Local vs. global (scope): explaining one prediction vs. the model's overall behaviour across the dataset.
Model-agnostic vs. model-specific (access), and attribution vs. counterfactual (shape of the answer).

The workhorse methods

LIME — perturb around one input, fit a local surrogate; its coefficients are the explanation (model-agnostic, local).
SHAP — game-theoretic Shapley values; additive, consistent feature attributions; TreeSHAP is exact for trees.
Saliency / Integrated Gradients / Grad-CAM — gradient-based, model-specific attributions for neural nets and CNNs; permutation importance gives a model-agnostic global view.

Mechanistic interpretability

Features & circuits — meaningful directions in activation space, wired together to implement computations.
Superposition makes neurons polysemantic, which is why per-neuron reading fails; sparse autoencoders recover cleaner features.
Attribution graphs trace internal multi-step reasoning — but findings are partial, model-specific, and self-reported.

The faithfulness trap

Faithfulness is not guaranteed — post-hoc explanations approximate the model (NIST "Explanation Accuracy" exists for this reason).
Methods disagree — cross-check, and don't trust a lone saliency map to expose a hidden bias.
Attention ≠ explanation (Jain & Wallace, 2019) — and even that claim is debated; present it as contested.

Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; the feature-attribution visualizer is a simulation with illustrative, hand-set values, labelled as such. Post-hoc explanations approximate a model and can be unfaithful, and mechanistic-interpretability findings are partial, model-specific, and (where noted) reported by the model's own developers — treat a single explanation as evidence, not proof.

"Why Should I Trust You?": Explaining the Predictions of Any Classifier (LIME) — Ribeiro, Singh & Guestrin (KDD 2016)
A Unified Approach to Interpreting Model Predictions (SHAP) — Lundberg & Lee (NeurIPS 2017)
Axiomatic Attribution for Deep Networks (Integrated Gradients) — Sundararajan, Taly & Yan (ICML 2017)
Grad-CAM: Visual Explanations via Gradient-based Localization — Selvaraju et al. (ICCV 2017)
Deep Inside Convolutional Networks: Saliency Maps — Simonyan, Vedaldi & Zisserman (2013)
Counterfactual Explanations without Opening the Black Box — Wachter, Mittelstadt & Russell (2017)
The Mythos of Model Interpretability — Zachary C. Lipton (2016)
Attention is not Explanation — Jain & Wallace (NAACL 2019)
Toy Models of Superposition — Elhage et al., Anthropic (2022)
Scaling Monosemanticity: Features from Claude 3 Sonnet — Templeton et al., Anthropic (2024)
Zoom In: An Introduction to Circuits — Olah et al. (Distill, 2020)
On the Biology of a Large Language Model (attribution graphs) — Anthropic (2025)
NISTIR 8312 — Four Principles of Explainable AI — NIST (2021)
Explainable Artificial Intelligence (XAI) program — DARPA
Captum — Model Interpretability for PyTorch — Meta / PyTorch
Permutation feature importance — scikit-learn
Searching for Unintended Biases With Saliency — Google PAIR

Educational content for general understanding. Explanations of AI behaviour are not guarantees of fairness or correctness; for high-stakes decisions (credit, employment, health, legal), use AI outputs as one input alongside qualified human review and applicable law.

Explainability & interpretability — in 5 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Two words, one goal

Interpretability ≈ understanding the model's mechanism directly (transparent models). Explainability (XAI) ≈ producing an account of why an opaque model decided something. The distinction is conventional, not standardized (Lipton).

Four classifying axes

Intrinsic vs. post-hoc (when) · local vs. global (scope) · model-agnostic vs. model-specific (access) · attribution vs. counterfactual (shape of the answer).

Workhorse methods

LIME — local surrogate around one input. SHAP — Shapley-value additive attributions. Saliency / Integrated Gradients / Grad-CAM — gradient-based, model-specific. Permutation importance — model-agnostic, global.

Mechanistic interpretability

Reverse-engineers internal features and circuits. Superposition makes neurons polysemantic; sparse autoencoders recover cleaner features; attribution graphs trace internal reasoning (partial, model-specific, often first-party).

The one trap

A plausible explanation can be unfaithful (NIST "Explanation Accuracy"). Methods disagree; attention ≠ explanation (contested); attribution shows the model's input–output relationship, not real-world causation. Treat any one explanation as evidence, not proof.

Gallery

Contacts

Why did the model decide that?

01Explainability vs. interpretability