Counterfactual Fairness Assessment

Description

Counterfactual Fairness Assessment evaluates whether a model's predictions would remain unchanged if an individual's protected attributes (race, gender, age) were different, whilst keeping all other causally legitimate factors constant. The technique requires constructing a causal graph that maps relationships between variables, then using do-calculus or structural causal models to simulate counterfactual scenarios. For example, it asks: 'Would this loan application still be approved if the applicant were a different race, holding constant their actual qualifications and economic circumstances?' This individual-level fairness criterion helps identify when decisions depend improperly on protected characteristics.

Example Use Cases

Fairness

Evaluating a hiring algorithm by testing whether qualified candidates would receive the same evaluation scores if their gender were different, whilst controlling for actual skills, experience, and education, revealing whether gender bias affects recruitment decisions.

Assessing a criminal sentencing model by examining whether defendants with identical criminal histories and case circumstances would receive the same sentence recommendations regardless of their race, identifying potential discriminatory patterns in judicial AI systems.

Limitations

Requires explicit specification of causal relationships between variables, which involves subjective assumptions about what constitutes legitimate versus illegitimate causal pathways.
May be mathematically impossible to satisfy simultaneously with other fairness criteria (like statistical parity), forcing practitioners to choose between competing fairness definitions.
Implementation complexity is high, requiring sophisticated causal inference techniques and structural causal models that are difficult to construct and validate.
Depends heavily on the quality and completeness of the causal graph, which may be incorrect or missing important confounding variables.

Resources

Counterfactual Fairness

Research Paper•Matt J. Kusner et al.•Dec 4, 2017

fairlearn/fairlearn

Software Package

Counterfactual Fairness in Text Classification through Robustness

Research Paper•Sahaj Garg et al.•Sep 27, 2018

Related Techniques

Name	Description	Assurance Goals
Saliency Maps	Saliency maps are visual explanations for image classification models that highlight which pixels in an image most strongly influence the model's prediction. Computed by calculating gradients of the model's output with respect to input pixels, saliency maps produce heatmaps where brighter regions indicate pixels that, when changed, would most significantly affect the prediction. This technique helps users understand which parts of an image the model is 'looking at' when making decisions.	Explainability Fairness
Preferential Sampling	A preprocessing fairness technique developed by Kamiran and Calders that addresses dataset imbalances by re-sampling training data with preference for underrepresented groups to achieve discrimination-free classification. This method modifies the training distribution by prioritising borderline objects (instances near decision boundaries) from underrepresented groups for duplication whilst potentially removing instances from overrepresented groups. Unlike relabelling approaches, preferential sampling maintains original class labels whilst creating a more balanced dataset that prevents models from learning biased patterns due to skewed group representation.	Fairness Reliability Transparency
Jackknife Resampling	Jackknife resampling (also called leave-one-out resampling) assesses model stability and uncertainty by systematically removing one data point at a time and retraining the model on the remaining data. Unlike bootstrapping which samples with replacement, jackknife creates n different models by excluding each of the n data points once. This systematic approach reveals how individual points influence results, provides robust estimates of prediction variance, and identifies unusually influential observations that may be outliers or leverage points affecting model reliability.	Reliability Transparency Fairness
Temperature Scaling	Temperature scaling adjusts a model's confidence by applying a single parameter (temperature) to its predictions. When a model is too confident in its wrong answers, temperature scaling can fix this by making the predictions more realistic. It works by dividing the model's outputs by the temperature value before converting them to probabilities. Higher temperatures make the model less confident, whilst lower temperatures increase confidence. The technique maintains the model's accuracy whilst ensuring that when it says it's 90% confident, it's actually right about 90% of the time.	Reliability Transparency Fairness
Disparate Impact Remover	Disparate Impact Remover is a preprocessing technique that transforms feature values in a dataset to reduce statistical dependence between features and protected attributes (like race or gender). The method modifies non-protected features through mathematical transformations that preserve the utility of the data whilst reducing correlations that could lead to discriminatory outcomes. This approach specifically targets the '80% rule' disparate impact threshold by adjusting feature distributions to ensure more equitable treatment across demographic groups in downstream model predictions.	Fairness Transparency Reliability
Attention Visualisation in Transformers	Attention Visualisation in Transformers analyses the multi-head self-attention mechanisms that enable transformers to process sequences by attending to different positions simultaneously. The technique visualises attention weights as heatmaps showing how strongly each token attends to every other token across different heads and layers. By examining these attention patterns, practitioners can understand how models like BERT, GPT, and T5 build contextual representations, identify which tokens influence predictions most strongly, and detect potential biases in how the model processes different types of input. This provides insights into positional encoding effects, head specialisation patterns, and the evolution of attention from local to global context across layers.	Explainability Fairness Transparency