Adversarial Training Evaluation

Description

Adversarial training evaluation assesses whether models trained with adversarial examples have genuinely improved robustness rather than merely overfitting to specific attack methods. This technique tests robustness against diverse attack algorithms including those not used during training, measures certified robustness bounds, and evaluates whether adversarial training creates exploitable trade-offs in clean accuracy or introduces new vulnerabilities. Evaluation ensures adversarial training provides genuine security benefits rather than superficial improvements.

Example Use Cases

Security

Verifying that an adversarially-trained facial recognition system demonstrates genuine robustness against diverse attack types beyond those used in training, preventing false confidence in security.

Verifying that adversarial training of a medical imaging classifier for tumour detection maintains diagnostic accuracy on routine cases whilst improving robustness to image quality variations and potential adversarial attacks.

Reliability

Ensuring adversarial training of a spam filter improves reliable detection of adversarial emails without significantly degrading performance on normal messages.

Evaluating whether adversarial training of a loan approval model maintains fair lending decisions whilst improving robustness against applicants attempting to game the system through strategic feature manipulation.

Assessing whether adversarially-trained automated essay grading systems remain reliable on standard student submissions whilst becoming more robust to attempts at deliberately confusing the model with adversarial writing patterns.

Transparency

Documenting adversarial training evaluation results in a model card or technical dossier to meet transparency and disclosure obligations for high-risk AI systems, enabling regulators and deployers to assess whether robustness claims are substantiated by rigorous evaluation across diverse attack types beyond those used in training.

Limitations

Models may overfit to adversarial examples in training data without generalizing to fundamentally different attack strategies.
Adversarial training typically reduces clean accuracy by 2-10 percentage points, requiring careful evaluation of security-accuracy trade-offs for each application.
Difficult to achieve certified robustness guarantees that hold against all possible attacks within a specified threat model.
Adversarial training increases training time by 2-10x compared to standard training, as each batch requires generating adversarial examples, making it resource-intensive for large models or datasets.
Requires diverse attack types for comprehensive evaluation beyond training distribution, necessitating ongoing research into emerging attack methods and regular re-evaluation cycles.

Resources

Research Papers

Adversarial Training: A Survey

Mengnan Zhao et al.•Jan 1, 2024

Adversarial training (AT) refers to integrating adversarial examples -- inputs altered with imperceptible perturbations that can significantly impact model predictions -- into the training process. Recent studies have demonstrated the effectiveness of AT in improving the robustness of deep neural networks against diverse adversarial attacks. However, a comprehensive overview of these developments is still missing. This survey addresses this gap by reviewing a broad range of recent and representative studies. Specifically, we first describe the implementation procedures and practical applications of AT, followed by a comprehensive review of AT techniques from three perspectives: data enhancement, network design, and training configurations. Lastly, we discuss common challenges in AT and propose several promising directions for future research.

ApaNet: adversarial perturbations alleviation network for face verification

Guangling Sun et al.•Jan 1, 2022

Albeit Deep neural networks (DNNs) are widely used in computer vision, natural language processing and speech recognition, they have been discovered to be fragile to adversarial attacks. Specifically, in computer vision, an attacker can easily deceive DNNs by contaminating an input image with perturbations imperceptible to humans. As one of the important vision tasks, face verification is also subject to adversarial attack. Thus, in this paper, we focus on defending against the adversarial attack for face verification to mitigate the potential risk. We learn a network via an implementation of stacked residual blocks, namely adversarial perturbations alleviation network (ApaNet), to alleviate latent adversarial perturbations hidden in the input facial image. During the supervised learning of ApaNet, only the Labeled Faces in the Wild (LFW) is used as the training set, and the legitimate examples and corresponding adversarial examples produced by projected gradient descent algorithm compose supervision and inputs respectively. By leveraging the middle and high layer’s activation of FaceNet, the discrepancy between an image output by ApaNet and the supervision is calculated as the loss function to optimize ApaNet. Empirical experiment results on the LFW, YouTube Faces DB and CASIA-FaceV5 confirm the effectiveness of the proposed defender against some representative white-box and black-box adversarial attacks. Also, experimental results show the superiority performance of the ApaNet as comparing with several currently available techniques.

Documentations

Welcome to the Adversarial Robustness Toolbox — Adversarial ...

Adversarial-robustness-toolbox Developers

Training and evaluating networks via command line — robustness ...

Robustness Developers•Jan 1, 2019

art.metrics — Adversarial Robustness Toolbox 1.17.0 documentation