Adversarial Training Evaluation
Description
Adversarial training evaluation assesses whether models trained with adversarial examples have genuinely improved robustness rather than merely overfitting to specific attack methods. This technique tests robustness against diverse attack algorithms including those not used during training, measures certified robustness bounds, and evaluates whether adversarial training creates exploitable trade-offs in clean accuracy or introduces new vulnerabilities. Evaluation ensures adversarial training provides genuine security benefits rather than superficial improvements.
Example Use Cases
Security
Verifying that an adversarially-trained facial recognition system demonstrates genuine robustness against diverse attack types beyond those used in training, preventing false confidence in security.
Verifying that adversarial training of a medical imaging classifier for tumour detection maintains diagnostic accuracy on routine cases whilst improving robustness to image quality variations and potential adversarial attacks.
Reliability
Ensuring adversarial training of a spam filter improves reliable detection of adversarial emails without significantly degrading performance on normal messages.
Evaluating whether adversarial training of a loan approval model maintains fair lending decisions whilst improving robustness against applicants attempting to game the system through strategic feature manipulation.
Assessing whether adversarially-trained automated essay grading systems remain reliable on standard student submissions whilst becoming more robust to attempts at deliberately confusing the model with adversarial writing patterns.
Limitations
- Models may overfit to adversarial examples in training data without generalizing to fundamentally different attack strategies.
- Adversarial training typically reduces clean accuracy by 2-10 percentage points, requiring careful evaluation of security-accuracy trade-offs for each application.
- Difficult to achieve certified robustness guarantees that hold against all possible attacks within a specified threat model.
- Adversarial training increases training time by 2-10x compared to standard training, as each batch requires generating adversarial examples, making it resource-intensive for large models or datasets.
- Requires diverse attack types for comprehensive evaluation beyond training distribution, necessitating ongoing research into emerging attack methods and regular re-evaluation cycles.
Resources
Research Papers
Adversarial Training: A Survey
Adversarial training (AT) refers to integrating adversarial examples -- inputs altered with imperceptible perturbations that can significantly impact model predictions -- into the training process. Recent studies have demonstrated the effectiveness of AT in improving the robustness of deep neural networks against diverse adversarial attacks. However, a comprehensive overview of these developments is still missing. This survey addresses this gap by reviewing a broad range of recent and representative studies. Specifically, we first describe the implementation procedures and practical applications of AT, followed by a comprehensive review of AT techniques from three perspectives: data enhancement, network design, and training configurations. Lastly, we discuss common challenges in AT and propose several promising directions for future research.
ApaNet: adversarial perturbations alleviation network for face verification
Albeit Deep neural networks (DNNs) are widely used in computer vision, natural language processing and speech recognition, they have been discovered to be fragile to adversarial attacks. Specifically, in computer vision, an attacker can easily deceive DNNs by contaminating an input image with perturbations imperceptible to humans. As one of the important vision tasks, face verification is also subject to adversarial attack. Thus, in this paper, we focus on defending against the adversarial attack for face verification to mitigate the potential risk. We learn a network via an implementation of stacked residual blocks, namely adversarial perturbations alleviation network (ApaNet), to alleviate latent adversarial perturbations hidden in the input facial image. During the supervised learning of ApaNet, only the Labeled Faces in the Wild (LFW) is used as the training set, and the legitimate examples and corresponding adversarial examples produced by projected gradient descent algorithm compose supervision and inputs respectively. By leveraging the middle and high layer’s activation of FaceNet, the discrepancy between an image output by ApaNet and the supervision is calculated as the loss function to optimize ApaNet. Empirical experiment results on the LFW, YouTube Faces DB and CASIA-FaceV5 confirm the effectiveness of the proposed defender against some representative white-box and black-box adversarial attacks. Also, experimental results show the superiority performance of the ApaNet as comparing with several currently available techniques.