Permutation Tests

Description

Permutation tests assess the statistical significance of observed results (such as model accuracy, feature importance, or group differences) by comparing them to what would occur purely by chance. The technique randomly shuffles labels or data thousands of times, recalculating the metric of interest each time to build an empirical null distribution. If the actual observed result falls in the extreme tail of this distribution (typically beyond the 95th or 99th percentile), it provides strong evidence that the relationship is genuine rather than due to random chance, without requiring parametric assumptions about data distributions.

Example Use Cases

Reliability

Validating feature importance in medical diagnosis models by permuting each feature 10,000 times to ensure that identified risk factors (e.g., blood pressure, cholesterol) have statistically significant predictive power beyond random chance.

Verifying that a model's claimed 95% accuracy on test data is genuinely better than random guessing by permuting labels 5,000 times and confirming the actual accuracy falls beyond the 99th percentile of the null distribution.

Explainability

Testing whether observed differences in loan approval rates between demographic groups are statistically significant by permuting group labels and calculating the approval rate difference distribution under the null hypothesis of no discrimination.

Limitations

  • Computationally expensive as it requires thousands of model evaluations or metric calculations, scaling poorly with dataset size and model complexity.
  • Requires many permutations (typically 5,000-10,000) to achieve reliable p-values for strict significance thresholds like p < 0.01.
  • Assumes exchangeability of observations under the null hypothesis, which may be violated in time series or hierarchical data structures.
  • Cannot be easily parallelised for some metrics that require global model retraining, limiting scalability for complex machine learning pipelines.

Resources

Research Papers

Permutation Tests for Classification
Golland, Polina, Mukherjee, Sayan, and Panchenko, DmitryJan 1, 2003
The Exchangeability Assumption for Permutation Tests of Multiple Regression Models: Implications for Statistics and Data Science Educators
Johanna Hardin et al.Jun 11, 2024

Permutation tests are a powerful and flexible approach to inference via resampling. As computational methods become more ubiquitous in the statistics curriculum, use of permutation tests has become more tractable. At the heart of the permutation approach is the exchangeability assumption, which determines the appropriate null sampling distribution. We explore the exchangeability assumption in the context of permutation tests for multiple linear regression models, including settings where the assumption is not tenable. Various permutation schemes for the multiple linear regression setting have been proposed and assessed in the literature. As has been demonstrated previously, in most settings, the choice of how to permute a multiple linear regression model does not materially change inferential conclusions with respect to Type I errors. However, some violations (e.g., when clustering is not appropriately accounted for) lead to issues with Type I error rates. Regardless, we believe that understanding (1) exchangeability in the multiple linear regression setting and also (2) how it relates to the null hypothesis of interest is valuable. We close with pedagogical recommendations for instructors who want to bring multiple linear regression permutation inference into their classroom as a way to deepen student understanding of resampling-based inference.

Tutorials

How to use Permutation Tests | Towards Data Science
Michael BerkSep 21, 2021
Permutation test in R | Towards Data Science
Serafim PetrovMar 15, 2021

Documentations

scikit-learn permutation_importance
Scikit-learn DevelopersJan 1, 2007

Tags

Explainability Dimensions

Attribution Methods:
Explanation Target:
Explanatory Scope:

Other Categories

Applicable Models:
Data Requirements:
Data Type:
Evidence Type:
Expertise Needed:
Technique Type: