Surely You Just Remove the Data?

What the research actually says about keeping dangerous knowledge out of AI models

Mar 23, 2026

TLDR

Removing dangerous knowledge from AI models after training does not work reliably. Filtering before training is stronger, but open weights undo it. The labs doing the filtering set their own standards, publish results voluntarily, and answer to no external auditor. That is the governance problem.

Contents

The premise and why it fails
Why machine unlearning does not remove knowledge
The stronger case for pre-training filtration
Three limits of pre-training filtration: aggregation, boundaries, open weights
What the research means for governance

If an AI model can produce dangerous content, the obvious fix seems simple: remove the data it learned from, and it can no longer produce that content. It is a reasonable assumption. It is also, in the main, wrong. The reasons why matter more than most public discussion of AI safety acknowledges.

There are two points at which the intervention can happen. You can filter dangerous content from the training data before the model ever ingests it. Or you can attempt to remove it after the fact, through a process called machine unlearning. Both approaches are active areas of serious research. Neither currently works to the standard the language around them implies.

Why machine unlearning does not remove knowledge

Machine unlearning is the technical term for attempts to remove specific knowledge from a model that has already been trained. The intuition behind it is straightforward. The reality of how neural networks store information makes it significantly harder than it sounds.

Knowledge in a large language model is distributed across billions of parameters, entangled with everything else the model has learned. When researchers attempt to make a model unlearn something, what they are typically doing is fine-tuning it to stop producing certain outputs. The underlying weights encoding that knowledge remain largely intact. The result is a filter over a vault. The vault stays full.

The evidence for this is consistent and accumulating. Adversarial relearning attacks, which fine-tune a model on a small subset of the data it was supposed to have forgotten, reliably restore the suppressed capabilities. Deeb and Roger (2025) demonstrated that this holds even when the fine-tuning data is informationally distinct from the original forget set, indicating the failure goes beyond simple rote memorisation.

A separate failure mode received less attention. Research presented at ICLR 2025 found that applying standard model quantisation, a common compression technique used in deployment, to a model that had undergone unlearning recovered approximately 89% of the nominally removed knowledge. Unlearning can be undone by routine engineering decisions made downstream of the safety intervention, with no adversarial intent required.

There is also the sandbagging problem. Models can be induced to strategically underperform on safety benchmarks while retaining full capability, with performance unlocked by a password or specific prompting pattern. A model that appears to have forgotten something under evaluation conditions may retain it entirely.

The WMDP benchmark, developed by the Center for AI Safety in 2024 to measure hazardous knowledge across biosecurity, cybersecurity and chemical security, was built in part because refusal training was already understood to be insufficient. Refusal training teaches a model to decline a question. Unlearning is meant to go further, removing the knowledge that would underpin an answer. The benchmark exists because the gap between those two things matters.

The stronger case for pretraining filtration

The stronger argument, and the one that has gained the most traction in the safety research community, is that the intervention needs to happen before training. A model that never ingests dangerous content cannot reconstruct it. A fully jailbroken model that lacks the underlying knowledge cannot be made to produce what it does not know.

The most comprehensive study of this approach to date is the Deep Ignorance paper from EleutherAI, the UK AI Security Institute and the University of Oxford, presented at the NeurIPS 2025 Workshop on Biosecurity Safeguards for Generative AI. The researchers pretrained multiple 6.9 billion parameter models from scratch, with training data filtered to remove biothreat-proxy content using a multi-stage pipeline combining a blocklist and a machine learning classifier. The filtered models showed substantial resistance to adversarial fine-tuning attacks involving up to 10,000 steps and 300 million tokens of biothreat-related text. Existing post-training safety techniques are typically undone within a few hundred fine-tuning steps. On that measure, pretraining filtration is more than an order of magnitude more effective than post-training removal.

Anthropic has published its own experiment in the same direction. Their alignment team filtered CBRN-related content from pretraining data using a classifier, retrained models from scratch on the filtered dataset, and measured the effect on performance across harmful and benign capability evaluations. The result was a 33% relative reduction in performance on a harmful capabilities benchmark, with no significant degradation to general capabilities. It is a meaningful number. It is not elimination.

How to Explain the Article to a 14 year old:
AI companies say they can scrub dangerous knowledge out of their models. The research says they basically cannot.
When you try to remove something after the model has already learned it, the knowledge stays. You have just told it to stop bringing it up. Someone who knows what they are doing can get it talking again pretty easily.
Teaching it only safe stuff from the start works better. But here is the problem. Dangerous knowledge does not come with a warning label. A biology textbook looks the same as a bioweapons manual to a filter. And sometimes the dangerous stuff is not in any single document at all. It is what you can figure out when you have read a million normal things and start joining the dots.
So even the better approach has gaps.
And nobody outside the companies is actually checking whether any of this works. The labs test themselves and publish what they want to publish.

Three limits of pretraining filtration

Three problems constrain what pretraining filtration can achieve, even when executed well.

The first is the aggregation problem. Filtering individual documents identified as hazardous does not address the capacity of large language models to synthesise dangerous knowledge from the combination of documents that are individually benign. Kevin Esvelt and colleagues at MIT have described this as an unmasking risk: LLMs can surface patterns across millions of sources that no individual source discloses. The dangerous content is in the adjacencies between documents at scale. A filtering pipeline that operates at the document level cannot address a risk that is emergent at the corpus level.

The second is the boundary problem. Dual-use knowledge has both hazardous and beneficial applications. The same microbiology that describes how a pathogen causes disease is the knowledge required to model how it might be engineered. Filtering for dangerous biology unavoidably removes content needed for vaccine development, biosurveillance and public health modelling. The line between what to keep and what to remove has no clean technical solution.

The third is the open-weights problem, and it is the most concrete. The Evo 2 genomic AI model excluded viral genomes that infect eukaryotic hosts from its training data as a deliberate biosecurity measure. The researchers then published the model weights, in line with standard academic practice. Within months, a team presented findings at the NeurIPS 2025 biosecurity workshop demonstrating they had fine-tuned the model on sequences from 110 human-infecting viruses, restoring capabilities the original developers had worked to prevent. The case demonstrates the ceiling on pretraining filtration as a safety strategy. It holds only under closed-weight deployment. The moment weights are public, any downstream actor with relevant data and modest compute can undo the curation decision made before release.

The Deep Ignorance paper is explicit on a further limit. Even filtered models retain the ability to use dangerous knowledge when it is provided in their context window, for example via a search tool or retrieved document. Pretraining filtration prevents a model from having internalised knowledge. It does not prevent a model from acting on knowledge it is given.

What the research means for governance

Both interventions are the right places to look. Pretraining filtration is meaningfully stronger, and the recent research makes a credible case for it as a tamper-resistant layer of defence that post-training techniques cannot match. Neither currently reaches the standard implied when companies describe their models as having had dangerous content removed.

OpenAI updated its own risk assessment in April 2025, concluding it expects its models to demonstrate what it classified as high risks in the near future, meaning the capability to substantially increase the likelihood of bioterrorist attacks by providing meaningful assistance to novices. That assessment came from the same organisation whose earlier evaluations had found only mild uplift.

The filtering decisions are currently made by individual labs, using proprietary methodologies, with no external verification. Where numbers are published at all, they appear voluntarily on alignment blogs. No independent auditor reviews them. No regulator sets the standard for what a sufficient reduction means.

The Evo case is a demonstration of what happens when safety decisions made without external oversight can be reversed by any downstream actor who disagrees with them.

Keeping dangerous knowledge out at the point of training is the stronger intervention. It needs to become a verifiable, externally audited standard, not a voluntary experiment. That is a governance problem, not a technical one.

Notes

Che et al. (2024), cited in: Layered Unlearning for Adversarial Relearning (2025). https://arxiv.org/abs/2505.09500
Deeb and Roger (2025), cited in: Open Problems in Machine Unlearning for AI Safety.
Catastrophic Failure of LLM Unlearning via Quantization, ICLR 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/ba79fb5c4fe70050752f20c90c5f07ca-Paper-Conference.pdf
Weij et al. (2024), sandbagging via password-locked models, cited in WMDP benchmark coverage.
Li et al. (2024). The WMDP Benchmark. Center for AI Safety. https://arxiv.org/abs/2403.03218
O’Brien et al. (2025). Deep Ignorance. EleutherAI / NeurIPS 2025. https://arxiv.org/abs/2508.06601
Anthropic Alignment Team (2025). Enhancing Model Safety through Pretraining Data Filtering. https://alignment.anthropic.com/2025/pretraining-data-filtering/
Kevin Esvelt, cited in: LLMs and Information Hazards, The Biosecurity Handbook.
Black et al. (2025). Open-weight genome language model safeguards. NeurIPS 2025 Workshop on Biosecurity Safeguards for Generative AI.
O’Brien et al. (2025), op. cit.
OpenAI (April 2025), cited in CSIS (2025).

Discussion about this post

Ready for more?