Detecting and preventing distillation attacks: why the AI industry is finally waking up
A late night of API logs, a 20,000 account proxy farm, and the sudden realization that capability copying is no longer theoretical but industrialized.
A security engineer at a mid sized AI shop notices thousands of near identical prompts arriving from accounts registered with fake universities and burner emails. On the surface it looks like a botnet doing sentiment analysis, but the pattern points to something else: systematic harvesting of model outputs for training a rival system. That is the moment the technical headache becomes an existential business problem.
The mainstream interpretation treats these incidents as policing problems for cloud providers and compliance teams. The overlooked angle is that illicit distillation rewrites the economics of model development and national policy by turning public inference into a cheap shortcut for building competing frontier models. This changes what product owners must defend and how the industry should coordinate.
Why this matters more than a handful of leaked weights
Anthropic publicly described industrial scale campaigns that produced millions of exchanges with its Claude model using tens of thousands of fraudulent accounts, arguing that adversaries used those outputs to train competing systems. (anthropic.com) This is not academic theory anymore; it is a documented commercial vector that targets the single most valuable asset modern AI companies possess, the mapping from inputs to high quality outputs.
Google’s Threat Intelligence Group reported similar behavior against Gemini, including one campaign that issued more than 100,000 prompts with the apparent goal of cloning reasoning logic. (arstechnica.com) That level of repetition is noisy, but it is also exactly what a distillation pipeline needs. The operational scale is now measurable and repeatable.
Who else is being targeted and why now
Large cloud hosted models are a globalized target because they provide cheap, scalable inference and the raw material for knowledge distillation. Companies that sell API access, offer freemium tiers, or enable broad developer ecosystems are especially exposed. Smaller teams building proprietary models for regulated industries are new high value targets because a cloned model can be retrained to evade safety controls and then monetized in gray markets.
Competitors and state linked labs both have incentives: a clone cuts years off a research timeline while avoiding the capital cost of ground up pretraining. That is why the threat blends corporate espionage and geopolitical risk.
How distillation attacks work in practice
An attacker registers or rents thousands of accounts and routes queries through proxy services to avoid rate limits and detection. Prompts are engineered to elicit chain of thought, tool use, or policy handling that a student model must reproduce. Outputs are stitched into a training set, then used to fine tune or supervise a smaller student model until it mimics the teacher’s behavior at a fraction of the original cost.
In some cases the attacker focuses queries on narrow capabilities like coding or simulated reasoning to accelerate capability parity without copying everything. This surgical approach can be surprisingly effective and embarrassingly cheap compared to training from raw web scrapes.
Detection strategies that actually work
Behavioral fingerprinting across accounts, classifiers trained to recognize distillation style prompts, and cross vendor intelligence sharing are practical first lines of defense. Anthropic describes using coordinated detection plus stricter verification for academic and startup accounts to raise the bar on fraudulent access. (anthropic.com) These are necessary but not sufficient because proxy farms adapt quickly and attackers can blend distillation traffic with legitimate use.
Watermarking generated outputs and embedding traceable signatures into model responses is another defensive category that academic teams are actively improving. One recent paper called ModelShield proposes adaptive watermarking that aims to be robust while preserving output quality. (arxiv.org) Like putting a fluorescent dye into a perfume bottle, it helps with attribution but is not invulnerable to determined scrubbing.
Why watermarks are not a silver bullet
A spate of research shows that watermarks can be spoofed or scrubbed during unauthorized distillation, and adaptive attacks can remove or overwrite watermark signals without severely degrading the student model. A study on unified attacks against LLM watermarks demonstrates techniques that both erase and forge watermark traces in distilled models, raising doubt about relying on watermarks alone. (arxiv.org) That means ownership verification will increasingly become a contested forensic exercise in court and in the lab, not a click to confirm.
Practical implications for businesses with numbers
If an attacker wants a plausible student model it may require on the order of 1 to 10 million high quality prompt response pairs depending on target complexity. Running those queries at an average cloud inference cost of $0.0005 per request implies a bill of roughly $500 to $5,000 plus proxy and compute costs to train the student. If the student yields a product generating $100,000 in revenue, the return on illicit investment is obvious. Companies must therefore calculate losses as stolen revenue plus accelerated time to market for a cloned competitor.
Operational controls buy time and make attacks more expensive. Rate limiting that increases marginal cost by 10 times turns that $500 to $5,000 bill into $5,000 to $50,000. That does not stop industrial players but it shifts which actors can viably steal capabilities.
Distillation attacks turn a public API into a conveyor belt for someone else’s model, and the conveyor belt looks shockingly cheap.
Small teams should take this personally because their defensibility is not just model performance but the cost and friction of extracting that performance. If extraction becomes cheap, product differentiation collapses.
The cost nobody is calculating
Legal fights over intellectual property and export controls are slow and expensive, and they do not retroactively prevent capability diffusion. Companies should budget for three parallel investments: detection and account hygiene, robust watermark and forensic capabilities, and legal and policy advocacy. Each of these is a recurring operating cost that scales with API surface area, and it will matter more for companies selling regulated access or handling sensitive corporate data.
A final pragmatic tool is designing models whose high value behaviors degrade gracefully when proxy evidence of distillation exists, for example rate limited tool calls or time limited high fidelity reasoning outputs.
Risks and open questions that need stress testing
The most important unknowns are attacker adaptation speed and the arms race in watermark removal. Watermarks that look robust today may be brittle against adaptive fine tuning and noise injection. Academic papers show promising defenses on paper but also outline concrete bypasses. (arxiv.org) There is also a regulatory risk: export controls designed to limit model diffusion create incentives for clandestine extraction and a diplomatic headache when attribution is contested.
What the industry should do next
Companies must treat distillation as a multidisciplinary problem that combines engineering, legal and threat intelligence. Operational collaboration between providers, shared indicators of compromise, and investment in provable defenses will be necessary to raise the economic cost of distillation beyond what opportunistic actors can bear. Policymakers should craft rules that make cross border misuse trackable without choking legitimate research.
Looking ahead
The next two years will likely see a burst of defensive innovation followed by adaptive attacks, which means businesses must build detection and response into product roadmaps as standard overhead, not optional insurance.
Key Takeaways
- Distillation attacks convert high quality model outputs into training data that can produce cheap clones, threatening both revenue and safety.
- Detection requires behavioral analytics, account verification, and cross vendor intelligence to identify industrial scale extraction.
- Watermarking and forensic methods help with attribution but are not yet reliable standalone defenses.
- Firms must budget for technical defenses, legal readiness, and active policy engagement to raise the cost of illicit distillation.
Frequently Asked Questions
How can a small AI startup detect if someone is trying to distill its model?
Monitor for high volume, narrowly focused queries spread across many accounts and IPs, and look for repeated prompt patterns that solicit chain of thought. Combining behavioral analytics with stricter account verification for unusual usage pathways reduces false negatives.
Can watermarking prove a model was stolen?
Watermarks can provide evidence but are not incontrovertible because adaptive attacks can erase or spoof marks. Watermarks are best used as one piece of forensic evidence alongside logs, metadata, and infrastructure indicators.
What immediate steps should a product manager take after spotting suspicious activity?
Throttle and isolate suspicious traffic, capture and preserve full request metadata, notify cloud vendors and peers, and consult legal counsel about preservation orders. Rapidly sharing indicators can help block proxy services being reused.
Does export control policy help stop distillation attacks?
Export controls can limit hardware and some direct model transfers but do not stop remote extraction via public inference. Policy needs to be paired with industry cooperation and technical controls to be effective.
How expensive is it to mount a distillation campaign at scale?
Costs vary widely by model complexity and target fidelity but can be surprisingly low relative to full model training. Defensive measures that raise per query cost and reduce signal quality are the most effective way to increase attacker expense.
Related Coverage
Readers may want to explore how model watermarking techniques are evolving, what legal options companies have for intellectual property protection in AI, and how cloud providers are adapting detection tooling for API abuse. These adjacent topics explain the toolbox companies must use to defend both revenue and public safety.
SOURCES: https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks https://arstechnica.com/ai/2026/02/attackers-prompted-gemini-over-100000-times-while-trying-to-clone-it-google-says/ https://arxiv.org/abs/2405.02365 https://arxiv.org/abs/2504.17480 https://arxiv.org/abs/2402.17938