Integrated Gradients

Description

Integrated Gradients is an attribution technique that explains a model's prediction by quantifying the contribution of each input feature. It works by accumulating gradients along a straight path from a user-defined baseline input (e.g., a black image or an all-zero vector) to the actual input. This path integral ensures that the attributions satisfy fundamental axioms like completeness (attributions sum up to the difference between the prediction and the baseline prediction) and sensitivity (non-zero attributions for features that change the prediction). The output is a set of importance scores, often visualised as heatmaps, indicating which parts of the input were most influential for the model's decision.

Example Use Cases

Explainability

Analysing a medical image classification model to understand which specific pixels or regions in an X-ray image contribute most to a diagnosis of pneumonia, ensuring the model focuses on relevant pathological features rather than artifacts.

Explaining the sentiment prediction of a natural language processing model by highlighting which words or phrases in a review most strongly influenced its classification as positive or negative, revealing the model's interpretative focus.

Limitations

Requires a carefully chosen and meaningful baseline input; an inappropriate baseline can lead to misleading or uninformative attributions.
The model must be differentiable, which limits its direct application to models with non-differentiable components or discrete inputs without workarounds.
Computationally more expensive than simple gradient-based methods, as it requires multiple gradient calculations along the integration path.
While satisfying completeness, the attributions can sometimes be visually noisy or difficult for humans to interpret intuitively, especially for complex inputs.

Resources

ankurtaly/Integrated-Gradients

Software Package

pytorch/captum

Software Package

Maximum Entropy Baseline for Integrated Gradients

Research Paper•Hanxiao Tan•Apr 12, 2022

Integrated Gradients from Scratch | Towards Data Science

Tutorial

Understanding Integrated Gradients with SmoothTaylor for Deep Neural Network Attribution

Research Paper•Gary S. W. Goh et al.•Apr 22, 2020

Related Techniques

Name	Description	Assurance Goals
Classical Attention Analysis in Neural Networks	Classical attention mechanisms in RNNs and CNNs create alignment matrices and temporal attention patterns that show how models focus on different input elements over time or space. This technique analyses these traditional attention patterns, particularly in encoder-decoder architectures and sequence-to-sequence models, where attention weights reveal which source elements influence each output step. Unlike transformer self-attention analysis, this focuses on understanding alignment patterns, temporal dependencies, and encoder-decoder attention dynamics in classical neural architectures.	Explainability
Concept Activation Vectors	Concept Activation Vectors (CAVs), also known as Testing with Concept Activation Vectors (TCAV), identify mathematical directions in neural network representation space that correspond to human-understandable concepts such as 'stripes', 'young', or 'medical equipment'. The technique works by finding linear directions that separate activations of concept examples from non-concept examples, then measuring how much these concept directions influence the model's predictions. This provides quantitative answers to questions like 'How much does the concept of youth affect this model's hiring decisions?' enabling systematic bias detection and model understanding.	Explainability Fairness Transparency
Permutation Importance	Permutation Importance quantifies a feature's contribution to a model's performance by randomly shuffling its values and measuring the resulting drop in predictive accuracy. If shuffling a feature significantly degrades the model's performance, that feature is considered important. This model-agnostic technique helps identify which inputs are genuinely driving predictions, rather than just being correlated with the outcome.	Explainability Reliability
Occlusion Sensitivity	Occlusion sensitivity tests which parts of the input are important by occluding (masking or removing) them and seeing how the model's prediction changes. For example, portions of an image can be covered up in a sliding window fashion; if the model's confidence drops significantly when a certain region is occluded, that region was important for the prediction.	Explainability
SHapley Additive exPlanations	SHAP explains model predictions by quantifying how much each input feature contributes to the outcome. It assigns an importance score to every feature, indicating whether it pushes the prediction towards or away from the average. The method systematically evaluates how predictions change as features are included or excluded, drawing on game theory concepts to ensure a fair distribution of contributions.	Explainability Fairness Reliability
Partial Dependence Plots	Partial Dependence Plots show how changing one or two features affects a model's predictions on average. The technique works by varying the selected feature(s) across their full range whilst keeping all other features fixed at their original values, then averaging the predictions. This creates a clear visualisation of whether increasing or decreasing a feature tends to increase or decrease predictions, and reveals patterns like linear trends, plateaus, or threshold effects that help explain model behaviour.	Explainability