Data Poisoning Detection

Description

Data poisoning detection identifies malicious training data designed to compromise model behaviour. This technique detects various poisoning attacks including backdoor injection (triggers causing specific behaviours), availability attacks (degrading overall performance), and targeted attacks (causing errors on specific inputs). Detection methods include statistical outlier analysis, influence function analysis to identify high-impact training points, and validation-based approaches that test for suspicious model behaviours indicative of poisoning.

Example Use Cases

Safety

Scanning training data for an autonomous driving system to detect images that might contain backdoor triggers designed to cause unsafe behavior when specific objects appear in the environment.

Detecting poisoned training data in recidivism prediction models used for sentencing recommendations, where malicious actors might inject manipulated records to systematically bias predictions for specific demographic groups.

Security

Protecting a federated learning system for hospital patient diagnosis models from malicious participants who might inject poisoned medical records designed to create backdoors that misclassify specific patient profiles or degrade overall diagnostic reliability.

Reliability

Ensuring a content moderation model hasn't been poisoned with data designed to make it fail on specific types of harmful content, maintaining reliable safety filtering.

Scanning training data for algorithmic trading models to identify poisoned market data designed to create exploitable patterns, ensuring reliable trading decisions and preventing market manipulation vulnerabilities.

Limitations

Sophisticated poisoning attacks can be designed to evade statistical detection by mimicking benign data distributions closely.
High false positive rates may lead to incorrectly filtering legitimate but unusual training examples, reducing training data quality and diversity.
Computationally expensive to apply influence function analysis or deep inspection to very large training datasets, potentially requiring hours to days for thorough analysis of datasets with millions of samples, particularly when using gradient-based detection methods.
Difficult to detect poisoning in scenarios where attackers have knowledge of detection methods and can adapt attacks accordingly.
Establishing ground truth for what constitutes 'poisoned' versus legitimate but unusual data is challenging, particularly when dealing with naturally occurring outliers or edge cases in the data distribution.

Resources

Research Papers

Adaptive and Robust Data Poisoning Detection and Sanitization in Wearable IoT Systems using Large Language Models

W. K. M Mithsara et al.•Nov 4, 2025

The widespread integration of wearable sensing devices in Internet of Things (IoT) ecosystems, particularly in healthcare, smart homes, and industrial applications, has required robust human activity recognition (HAR) techniques to improve functionality and user experience. Although machine learning models have advanced HAR, they are increasingly susceptible to data poisoning attacks that compromise the data integrity and reliability of these systems. Conventional approaches to defending against such attacks often require extensive task-specific training with large, labeled datasets, which limits adaptability in dynamic IoT environments. This work proposes a novel framework that uses large language models (LLMs) to perform poisoning detection and sanitization in HAR systems, utilizing zero-shot, one-shot, and few-shot learning paradigms. Our approach incorporates \textit{role play} prompting, whereby the LLM assumes the role of expert to contextualize and evaluate sensor anomalies, and \textit{think step-by-step} reasoning, guiding the LLM to infer poisoning indicators in the raw sensor data and plausible clean alternatives. These strategies minimize reliance on curation of extensive datasets and enable robust, adaptable defense mechanisms in real-time. We perform an extensive evaluation of the framework, quantifying detection accuracy, sanitization quality, latency, and communication cost, thus demonstrating the practicality and effectiveness of LLMs in improving the security and reliability of wearable IoT systems.

SAFELOC: Overcoming Data Poisoning Attacks in Heterogeneous Federated Machine Learning for Indoor Localization

Akhil Singampalli, Danish Gufran, and Sudeep Pasricha•Nov 13, 2024

Machine learning (ML) based indoor localization solutions are critical for many emerging applications, yet their efficacy is often compromised by hardware/software variations across mobile devices (i.e., device heterogeneity) and the threat of ML data poisoning attacks. Conventional methods aimed at countering these challenges show limited resilience to the uncertainties created by these phenomena. In response, in this paper, we introduce SAFELOC, a novel framework that not only minimizes localization errors under these challenging conditions but also ensures model compactness for efficient mobile device deployment. Our framework targets a distributed and co-operative learning environment that uses federated learning (FL) to preserve user data privacy and assumes heterogeneous mobile devices carried by users (just like in most real-world scenarios). Within this heterogeneous FL context, SAFELOC introduces a novel fused neural network architecture that performs data poisoning detection and localization, with a low model footprint. Additionally, a dynamic saliency map-based aggregation strategy is designed to adapt based on the severity of the detected data poisoning scenario. Experimental evaluations demonstrate that SAFELOC achieves improvements of up to 5.9x in mean localization error, 7.8x in worst-case localization error, and a 2.1x reduction in model inference latency compared to state-of-the-art indoor localization frameworks, across diverse building floorplans, mobile devices, and ML data poisoning attack scenarios.

Understanding Influence Functions and Datamodels via Harmonic Analysis

Nikunj Saunshi et al.•Oct 3, 2022

Influence functions estimate effect of individual data points on predictions of the model on test data and were adapted to deep learning in Koh and Liang [2017]. They have been used for detecting data poisoning, detecting helpful and harmful examples, influence of groups of datapoints, etc. Recently, Ilyas et al. [2022] introduced a linear regression method they termed datamodels to predict the effect of training points on outputs on test data. The current paper seeks to provide a better theoretical understanding of such interesting empirical phenomena. The primary tool is harmonic analysis and the idea of noise stability. Contributions include: (a) Exact characterization of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the residual error and quality of the optimum linear datamodel without having to train the datamodel. (c) New insights into when influences of groups of datapoints may or may not add up linearly.

Documentations

art.defences.detector.poison — Adversarial Robustness Toolbox ...

Adversarial-robustness-toolbox Developers

Introduction — trojai 0.2.22 documentation

Trojai Developers•Jan 1, 2021

Related Techniques

Name	Description	Assurance Goals
Influence Functions	Influence functions quantify how much each training example influenced a model's predictions by computing the change in prediction that would occur if that training example were removed and the model retrained. Using calculus and the implicit function theorem, they approximate this 'leave-one-out' effect without actually retraining the model by computing gradients and Hessian information. This mathematical approach reveals which specific training examples were most responsible for pushing the model toward or away from particular predictions, enabling practitioners to trace problematic outputs back to their root causes in the training data.	Explainability Fairness Privacy
Federated Learning	Federated learning enables collaborative model training across multiple distributed parties (devices, organisations, or data centres) without requiring centralised data sharing. Participants train models locally on their private datasets and only share model updates (gradients, weights, or aggregated statistics) with a central coordinator. This distributed approach serves multiple purposes: preserving data privacy and sovereignty, reducing communication costs, enabling learning from diverse data sources, improving model robustness through heterogeneous training, and facilitating compliance with data protection regulations whilst maintaining model performance comparable to centralised training.	Privacy Reliability Safety Fairness
Adversarial Training Evaluation	Adversarial training evaluation assesses whether models trained with adversarial examples have genuinely improved robustness rather than merely overfitting to specific attack methods. This technique tests robustness against diverse attack algorithms including those not used during training, measures certified robustness bounds, and evaluates whether adversarial training creates exploitable trade-offs in clean accuracy or introduces new vulnerabilities. Evaluation ensures adversarial training provides genuine security benefits rather than superficial improvements.	Security Reliability Transparency