Adversarial Robustness Testing

Description

Adversarial robustness testing evaluates model resilience against intentionally crafted inputs designed to cause misclassification or malfunction. This technique generates adversarial examples through methods like FGSM, PGD, Carlini & Wagner (C&W) attacks, and AutoAttack to measure model vulnerability. Testing assesses both white-box scenarios (where attackers have full model access) and black-box scenarios (where attackers only observe inputs and outputs), measuring metrics like robust accuracy, attack success rates, and perturbation budgets required for successful attacks.

Example Use Cases

Safety

Testing an autonomous vehicle's perception system against adversarial perturbations to road signs or traffic signals, ensuring the vehicle cannot be fooled by maliciously modified visual inputs.

Security

Evaluating a spam filter's robustness to adversarial text manipulations such as character substitutions, synonym replacements, and formatting tricks that attackers use to craft phishing emails that evade detection whilst remaining readable to humans.

Testing a credit risk assessment model against adversarial perturbations in loan application data, ensuring attackers cannot manipulate minor input features like stated income or employment history to obtain fraudulent credit approvals.

Reliability

Assessing whether a medical imaging classifier maintains reliable performance when images are slightly corrupted or manipulated, preventing misdiagnosis from low-quality or tampered inputs.

Limitations

Adversarial examples often transfer poorly between different models or architectures, making it difficult to comprehensively test against all possible attacks.
Focus on worst-case perturbations may not reflect realistic threat models, as some adversarial examples require unrealistic levels of control over inputs.
Computationally expensive to generate strong adversarial examples, often requiring 10-100x more computation than standard inference, especially for large models or high-dimensional inputs.
Trade-off between robustness and accuracy means improving adversarial robustness often reduces performance on clean, unperturbed data.
Requires representative test data covering expected deployment scenarios, as adversarial testing on narrow datasets may miss vulnerabilities that emerge in real-world conditions.
Requires expertise in both attack methodology and domain knowledge to design meaningful adversarial perturbations that reflect genuine threats rather than artificial edge cases.

Resources

Research Papers

A Survey of Neural Network Robustness Assessment in Image Recognition

Jie Wang et al.•Jan 1, 2024

In recent years, there has been significant attention given to the robustness assessment of neural networks. Robustness plays a critical role in ensuring reliable operation of artificial intelligence (AI) systems in complex and uncertain environments. Deep learning's robustness problem is particularly significant, highlighted by the discovery of adversarial attacks on image classification models. Researchers have dedicated efforts to evaluate robustness in diverse perturbation conditions for image recognition tasks. Robustness assessment encompasses two main techniques: robustness verification/ certification for deliberate adversarial attacks and robustness testing for random data corruptions. In this survey, we present a detailed examination of both adversarial robustness (AR) and corruption robustness (CR) in neural network assessment. Analyzing current research papers and standards, we provide an extensive overview of robustness assessment in image recognition. Three essential aspects are analyzed: concepts, metrics, and assessment methods. We investigate the perturbation metrics and range representations used to measure the degree of perturbations on images, as well as the robustness metrics specifically for the robustness conditions of classification models. The strengths and limitations of the existing methods are also discussed, and some potential directions for future research are provided.

Adversarial Training: A Survey

Mengnan Zhao et al.•Jan 1, 2024

Adversarial training (AT) refers to integrating adversarial examples -- inputs altered with imperceptible perturbations that can significantly impact model predictions -- into the training process. Recent studies have demonstrated the effectiveness of AT in improving the robustness of deep neural networks against diverse adversarial attacks. However, a comprehensive overview of these developments is still missing. This survey addresses this gap by reviewing a broad range of recent and representative studies. Specifically, we first describe the implementation procedures and practical applications of AT, followed by a comprehensive review of AT techniques from three perspectives: data enhancement, network design, and training configurations. Lastly, we discuss common challenges in AT and propose several promising directions for future research.

Documentations

Welcome to the Adversarial Robustness Toolbox — Adversarial ...

Adversarial-robustness-toolbox Developers

SecML-Torch: A Library for Robustness Evaluation of Deep ...

Jan 1, 2022

What is an adversarial attack in NLP? — TextAttack 0.3.10 ...

Textattack Developers•Jan 1, 2020

Related Techniques

Name	Description	Assurance Goals
Data Poisoning Detection	Data poisoning detection identifies malicious training data designed to compromise model behaviour. This technique detects various poisoning attacks including backdoor injection (triggers causing specific behaviours), availability attacks (degrading overall performance), and targeted attacks (causing errors on specific inputs). Detection methods include statistical outlier analysis, influence function analysis to identify high-impact training points, and validation-based approaches that test for suspicious model behaviours indicative of poisoning.	Security Reliability Safety
Multi-Agent System Testing	Multi-agent system testing evaluates safety and reliability of systems where multiple AI agents interact, coordinate, or compete. This technique assesses emergent behaviours, communication protocols, conflict resolution, and whether agents maintain objectives appropriately. Testing produces reports on agent interaction patterns, coalition formation, failure cascade analysis, and safety property violations in multi-agent scenarios.	Safety Reliability Security
Model Watermarking and Theft Detection	Model watermarking and theft detection techniques protect AI systems from unauthorised replication by embedding detectable signatures in model outputs and identifying suspiciously similar prediction patterns. This includes watermarking schemes that survive knowledge distillation, fingerprinting methods that create unique statistical signatures, and detection methods that identify when a model has been stolen or replicated through model extraction, distillation, or imitation. These techniques enable model owners to prove intellectual property theft and protect proprietary AI systems.	Security Transparency Fairness
Prompt Injection Testing	Prompt injection testing systematically evaluates LLMs and generative AI systems for vulnerabilities where malicious users can override system instructions through carefully crafted inputs. This technique encompasses both offensive security testing (proactively attempting injection attacks to identify vulnerabilities) and defensive detection mechanisms (monitoring for suspicious patterns in prompts and context). Testing involves evaluating various injection patterns including direct instruction manipulation, context confusion, role-playing exploits, delimiter attacks, and context window poisoning. Detection methods analyse context provenance, validate formatting consistency, and identify semantic anomalies that indicate manipulation attempts rather than legitimate user interaction.	Security Safety Reliability