Model Watermarking and Theft Detection

Description

Model watermarking and theft detection techniques protect AI systems from unauthorised replication by embedding detectable signatures in model outputs and identifying suspiciously similar prediction patterns. This includes watermarking schemes that survive knowledge distillation, fingerprinting methods that create unique statistical signatures, and detection methods that identify when a model has been stolen or replicated through model extraction, distillation, or imitation. These techniques enable model owners to prove intellectual property theft and protect proprietary AI systems.

Example Use Cases

Security

Protecting a proprietary medical imaging diagnostic model from theft by embedding watermarks that survive if competitors attempt to distill or extract the model, enabling hospitals to verify they're using legitimate licensed versions.

Transparency

Providing forensic evidence in intellectual property litigation by demonstrating through watermark extraction and statistical fingerprinting that a competitor's fraud detection system was derived from a bank's proprietary model.

Limitations

Watermarks may be removed or degraded through post-processing, fine-tuning, or adversarial training by sophisticated attackers.
Difficult to distinguish between independent development of similar capabilities and actual behavioral cloning, especially for simple tasks.
Detection methods may produce false positives when models trained on similar data naturally develop comparable behaviors.
Watermarking can slightly degrade model performance or be detectable by attackers, creating trade-offs between protection strength and model quality.
Effectiveness varies significantly by model type and task, with some architectures (like transformers) and domains (like natural language) being more amenable to watermarking than others (like small computer vision models).
Legal frameworks for using watermarking evidence in intellectual property cases are still evolving, and successful theft claims may require complementary evidence beyond watermark detection alone.

Resources

Research Papers

A Systematic Review on Model Watermarking for Neural Networks

Franziska Boenisch•Nov 29, 2021

Machine learning (ML) models are applied in an increasing variety of domains. The availability of large amounts of data and computational resources encourages the development of ever more complex and valuable models. These models are considered the intellectual property of the legitimate parties who have trained them, which makes their protection against stealing, illegitimate redistribution, and unauthorized application an urgent need. Digital watermarking presents a strong mechanism for marking model ownership and, thereby, offers protection against those threats. This work presents a taxonomy identifying and analyzing different classes of watermarking schemes for ML models. It introduces a unified threat model to allow structured reasoning on and comparison of the effectiveness of watermarking methods in different scenarios. Furthermore, it systematizes desired security requirements and attacks against ML model watermarking. Based on that framework, representative literature from the field is surveyed to illustrate the taxonomy. Finally, shortcomings and general limitations of existing approaches are discussed, and an outlook on future research directions is given.

Protecting Intellectual Property of Deep Neural Networks with Watermarking

Jialong Zhang et al.•Jan 1, 2018

Deep learning technologies, which are the key components of state-of-the-art Artificial Intelligence (AI) services, have shown great success in providing human-level capabilities for a variety of tasks, such as visual analysis, speech recognition, and natural language processing and etc. Building a production-level deep learning model is a non-trivial task, which requires a large amount of training data, powerful computing resources, and human expertises. Therefore, illegitimate reproducing, distribution, and the derivation of proprietary deep learning models can lead to copyright infringement and economic harm to model creators. Therefore, it is essential to devise a technique to protect the intellectual property of deep learning models and enable external verification of the model ownership. In this paper, we generalize the "digital watermarking'' concept from multimedia ownership verification to deep neural network (DNNs) models. We investigate three DNN-applicable watermark generation algorithms, propose a watermark implanting approach to infuse watermark into deep learning models, and design a remote verification mechanism to determine the model ownership. By extending the intrinsic generalization and memorization capabilities of deep neural networks, we enable the models to learn specially crafted watermarks at training and activate with pre-specified predictions when observing the watermark patterns at inference. We evaluate our approach with two image recognition benchmark datasets. Our framework accurately (100%) and quickly verifies the ownership of all the remotely deployed deep learning models without affecting the model accuracy for normal input data. In addition, the embedded watermarks in DNN models are robust and resilient to different counter-watermark mechanisms, such as fine-tuning, parameter pruning, and model inversion attacks.

Software Packages

Watermark-Robustness-Toolbox

Aug 9, 2021

The official implementation of the IEEE S&P`22 paper "SoK: How Robust is Deep Neural Network Image Classification Watermarking".

dnn-watermark

Jan 15, 2017

Implementation of "Embedding Watermarks into Deep Neural Networks," in Proc. of ICMR'17.

Tutorials

Watermark and protect your Deep Neural Networks!

Related Techniques

Name	Description	Assurance Goals
Data Poisoning Detection	Data poisoning detection identifies malicious training data designed to compromise model behaviour. This technique detects various poisoning attacks including backdoor injection (triggers causing specific behaviours), availability attacks (degrading overall performance), and targeted attacks (causing errors on specific inputs). Detection methods include statistical outlier analysis, influence function analysis to identify high-impact training points, and validation-based approaches that test for suspicious model behaviours indicative of poisoning.	Security Reliability Safety
Model Extraction Defence Testing	Model extraction defence testing evaluates protections against attackers who attempt to steal model functionality by querying it and training surrogate models. This technique assesses defences like query limiting, output perturbation, watermarking, and fingerprinting by simulating extraction attacks and measuring how much model functionality can be replicated. Testing evaluates both the effectiveness of defences in preventing extraction and their impact on legitimate use cases, ensuring security measures don't excessively degrade user experience.	Security Privacy Transparency
Membership Inference Attack Testing	Membership inference attack testing evaluates whether adversaries can determine if specific data points were included in a model's training set. This technique simulates attacks where adversaries use model confidence scores, prediction patterns, or loss values to distinguish training data from non-training data. Testing measures privacy leakage by calculating attack success rates, precision-recall trade-offs, and advantage over random guessing. Results inform decisions about privacy-enhancing techniques like differential privacy or regularisation.	Privacy Security Transparency