Model Extraction Defence Testing

Description

Model extraction defence testing evaluates protections against attackers who attempt to steal model functionality by querying it and training surrogate models. This technique assesses defences like query limiting, output perturbation, watermarking, and fingerprinting by simulating extraction attacks and measuring how much model functionality can be replicated. Testing evaluates both the effectiveness of defences in preventing extraction and their impact on legitimate use cases, ensuring security measures don't excessively degrade user experience.

Example Use Cases

Security

Testing protections for a proprietary fraud detection API to ensure competitors cannot recreate the model's decision boundaries through systematic querying, by simulating extraction attacks using query budgets, active learning strategies, and substitute model training.

Evaluating whether rate limiting and output obfuscation for an automated essay grading API prevent competitors from extracting the scoring model through systematic submission of probe essays designed to reverse-engineer grading criteria.

Privacy

Evaluating whether a medical diagnosis model's query limits and output perturbations prevent extraction while protecting patient privacy embedded in the model's learned patterns.

Transparency

Assessing watermarking techniques that enable model owners to prove when competitors have extracted their model, providing transparent evidence for intellectual property claims.

Reliability

Testing whether a traffic prediction API's defensive perturbations prevent extraction of the underlying routing optimization model whilst maintaining sufficient accuracy for legitimate urban planning applications.

Limitations

  • Sophisticated attackers may use transfer learning, active learning, or knowledge distillation to extract models with 10-50x fewer queries than static defences anticipate, and can adapt their strategies as they probe defences, requiring dynamic rather than static protection mechanisms.
  • Defensive measures like output perturbation can degrade model utility for legitimate users, creating tension between security and usability.
  • Difficult to distinguish between legitimate high-volume use and malicious extraction attempts, potentially blocking valid users.
  • Watermarking and fingerprinting techniques may be removed or obscured by attackers who post-process extracted models.
  • Difficult to validate defence effectiveness without exposing the model to actual extraction attempts, and limited public benchmarks make it challenging to compare defence strategies objectively across different model types and threat scenarios.
  • Requires specialised expertise in adversarial machine learning and attack simulation to design realistic extraction scenarios, making it challenging for organisations without dedicated security teams to implement comprehensive testing.

Resources

Research Papers

Hypothesis Testing and Beyond: a Mini Survey on Membership Inference Attacks
Jiajie Liu et al.Jan 1, 2025

Membership Inference Attacks (MIA) have received significant attention from academia as a crucial means of evaluating privacy risks in machine learning models. With the introduction of formal modeling based on hypothesis testing, MIA research has entered a new phase of development. However, there is currently a lack of systematic review of recent technical innovations and evaluation frameworks in MIA. This paper focuses on the latest developments in MIA research since 2022. Building upon the classification framework proposed by Hu et al., we systematically examine the development trajectory of two main attack approaches: metric-based attacks and neural networkbased attacks. For metric-based attacks, starting with LiRA, we provide a detailed analysis of likelihood ratio-based attack methods (such as Enhanced MIA and RMIA) and their technical innovations under the hypothesis testing framework. For neural network-based attacks, we concentrate on breakthroughs in feature extraction and temporal modeling achieved by novel attack strategies (such as QMIA and SeqMIA). Furthermore, this paper thoroughly examines the applicability of evaluation metrics and analyzes the challenges of MIA in emerging scenarios such as time-series data and large language models. Through this systematic review, we aim to provide theoretical guidance for improving MIA techniques and promote their standardized application in model privacy auditing.

Tutorials

Adversarial Machine Learning: Defense Strategies
Michał OleszakJul 11, 2024

Documentations

Welcome to the Adversarial Robustness Toolbox — Adversarial ...
Adversarial-robustness-toolbox Developers

Tags

Applicable Models:
Data Type:
Data Requirements:
Technique Type:
Evidence Type: