Discarded.AI: Learning

Surely You Just Remove the Data?

Alan Robertson — Mon, 23 Mar 2026 21:00:02 GMT

TLDR

Removing dangerous knowledge from AI models after training does not work reliably. Filtering before training is stronger, but open weights undo it. The labs doing the filtering set their own standards, publish results voluntarily, and answer to no external auditor. That is the governance problem.

Contents

The premise and why it fails
Why machine unlearning does not remove knowledge
The stronger case for pre-training filtration
Three limits of pre-training filtration: aggregation, boundaries, open weights
What the research means for governance

If an AI model can produce dangerous content, the obvious fix seems simple: remove the data it learned from, and it can no longer produce that content. It is a reasonable assumption. It is also, in the main, wrong. The reasons why matter more than most public discussion of AI safety acknowledges.

There are two points at which the intervention can happen. You can filter dangerous content from the training data before the model ever ingests it. Or you can attempt to remove it after the fact, through a process called machine unlearning. Both approaches are active areas of serious research. Neither currently works to the standard the language around them implies.

Why machine unlearning does not remove knowledge

Machine unlearning is the technical term for attempts to remove specific knowledge from a model that has already been trained. The intuition behind it is straightforward. The reality of how neural networks store information makes it significantly harder than it sounds.

Knowledge in a large language model is distributed across billions of parameters, entangled with everything else the model has learned. When researchers attempt to make a model unlearn something, what they are typically doing is fine-tuning it to stop producing certain outputs. The underlying weights encoding that knowledge remain largely intact. The result is a filter over a vault. The vault stays full.

The evidence for this is consistent and accumulating. Adversarial relearning attacks, which fine-tune a model on a small subset of the data it was supposed to have forgotten, reliably restore the suppressed capabilities. Deeb and Roger (2025) demonstrated that this holds even when the fine-tuning data is informationally distinct from the original forget set, indicating the failure goes beyond simple rote memorisation.

A separate failure mode received less attention. Research presented at ICLR 2025 found that applying standard model quantisation, a common compression technique used in deployment, to a model that had undergone unlearning recovered approximately 89% of the nominally removed knowledge. Unlearning can be undone by routine engineering decisions made downstream of the safety intervention, with no adversarial intent required.

There is also the sandbagging problem. Models can be induced to strategically underperform on safety benchmarks while retaining full capability, with performance unlocked by a password or specific prompting pattern. A model that appears to have forgotten something under evaluation conditions may retain it entirely.

The WMDP benchmark, developed by the Center for AI Safety in 2024 to measure hazardous knowledge across biosecurity, cybersecurity and chemical security, was built in part because refusal training was already understood to be insufficient. Refusal training teaches a model to decline a question. Unlearning is meant to go further, removing the knowledge that would underpin an answer. The benchmark exists because the gap between those two things matters.

The stronger case for pretraining filtration

The stronger argument, and the one that has gained the most traction in the safety research community, is that the intervention needs to happen before training. A model that never ingests dangerous content cannot reconstruct it. A fully jailbroken model that lacks the underlying knowledge cannot be made to produce what it does not know.

The most comprehensive study of this approach to date is the Deep Ignorance paper from EleutherAI, the UK AI Security Institute and the University of Oxford, presented at the NeurIPS 2025 Workshop on Biosecurity Safeguards for Generative AI. The researchers pretrained multiple 6.9 billion parameter models from scratch, with training data filtered to remove biothreat-proxy content using a multi-stage pipeline combining a blocklist and a machine learning classifier. The filtered models showed substantial resistance to adversarial fine-tuning attacks involving up to 10,000 steps and 300 million tokens of biothreat-related text. Existing post-training safety techniques are typically undone within a few hundred fine-tuning steps. On that measure, pretraining filtration is more than an order of magnitude more effective than post-training removal.

Anthropic has published its own experiment in the same direction. Their alignment team filtered CBRN-related content from pretraining data using a classifier, retrained models from scratch on the filtered dataset, and measured the effect on performance across harmful and benign capability evaluations. The result was a 33% relative reduction in performance on a harmful capabilities benchmark, with no significant degradation to general capabilities. It is a meaningful number. It is not elimination.

How to Explain the Article to a 14 year old:
AI companies say they can scrub dangerous knowledge out of their models. The research says they basically cannot.
When you try to remove something after the model has already learned it, the knowledge stays. You have just told it to stop bringing it up. Someone who knows what they are doing can get it talking again pretty easily.
Teaching it only safe stuff from the start works better. But here is the problem. Dangerous knowledge does not come with a warning label. A biology textbook looks the same as a bioweapons manual to a filter. And sometimes the dangerous stuff is not in any single document at all. It is what you can figure out when you have read a million normal things and start joining the dots.
So even the better approach has gaps.
And nobody outside the companies is actually checking whether any of this works. The labs test themselves and publish what they want to publish.

Three limits of pretraining filtration

Three problems constrain what pretraining filtration can achieve, even when executed well.

The first is the aggregation problem. Filtering individual documents identified as hazardous does not address the capacity of large language models to synthesise dangerous knowledge from the combination of documents that are individually benign. Kevin Esvelt and colleagues at MIT have described this as an unmasking risk: LLMs can surface patterns across millions of sources that no individual source discloses. The dangerous content is in the adjacencies between documents at scale. A filtering pipeline that operates at the document level cannot address a risk that is emergent at the corpus level.

The second is the boundary problem. Dual-use knowledge has both hazardous and beneficial applications. The same microbiology that describes how a pathogen causes disease is the knowledge required to model how it might be engineered. Filtering for dangerous biology unavoidably removes content needed for vaccine development, biosurveillance and public health modelling. The line between what to keep and what to remove has no clean technical solution.

The third is the open-weights problem, and it is the most concrete. The Evo 2 genomic AI model excluded viral genomes that infect eukaryotic hosts from its training data as a deliberate biosecurity measure. The researchers then published the model weights, in line with standard academic practice. Within months, a team presented findings at the NeurIPS 2025 biosecurity workshop demonstrating they had fine-tuned the model on sequences from 110 human-infecting viruses, restoring capabilities the original developers had worked to prevent. The case demonstrates the ceiling on pretraining filtration as a safety strategy. It holds only under closed-weight deployment. The moment weights are public, any downstream actor with relevant data and modest compute can undo the curation decision made before release.

The Deep Ignorance paper is explicit on a further limit. Even filtered models retain the ability to use dangerous knowledge when it is provided in their context window, for example via a search tool or retrieved document. Pretraining filtration prevents a model from having internalised knowledge. It does not prevent a model from acting on knowledge it is given.

What the research means for governance

Both interventions are the right places to look. Pretraining filtration is meaningfully stronger, and the recent research makes a credible case for it as a tamper-resistant layer of defence that post-training techniques cannot match. Neither currently reaches the standard implied when companies describe their models as having had dangerous content removed.

OpenAI updated its own risk assessment in April 2025, concluding it expects its models to demonstrate what it classified as high risks in the near future, meaning the capability to substantially increase the likelihood of bioterrorist attacks by providing meaningful assistance to novices. That assessment came from the same organisation whose earlier evaluations had found only mild uplift.

The filtering decisions are currently made by individual labs, using proprietary methodologies, with no external verification. Where numbers are published at all, they appear voluntarily on alignment blogs. No independent auditor reviews them. No regulator sets the standard for what a sufficient reduction means.

The Evo case is a demonstration of what happens when safety decisions made without external oversight can be reversed by any downstream actor who disagrees with them.

Keeping dangerous knowledge out at the point of training is the stronger intervention. It needs to become a verifiable, externally audited standard, not a voluntary experiment. That is a governance problem, not a technical one.

Notes

Che et al. (2024), cited in: Layered Unlearning for Adversarial Relearning (2025). https://arxiv.org/abs/2505.09500
Deeb and Roger (2025), cited in: Open Problems in Machine Unlearning for AI Safety.
Catastrophic Failure of LLM Unlearning via Quantization, ICLR 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/ba79fb5c4fe70050752f20c90c5f07ca-Paper-Conference.pdf
Weij et al. (2024), sandbagging via password-locked models, cited in WMDP benchmark coverage.
Li et al. (2024). The WMDP Benchmark. Center for AI Safety. https://arxiv.org/abs/2403.03218
O’Brien et al. (2025). Deep Ignorance. EleutherAI / NeurIPS 2025. https://arxiv.org/abs/2508.06601
Anthropic Alignment Team (2025). Enhancing Model Safety through Pretraining Data Filtering. https://alignment.anthropic.com/2025/pretraining-data-filtering/
Kevin Esvelt, cited in: LLMs and Information Hazards, The Biosecurity Handbook.
Black et al. (2025). Open-weight genome language model safeguards. NeurIPS 2025 Workshop on Biosecurity Safeguards for Generative AI.
O’Brien et al. (2025), op. cit.
OpenAI (April 2025), cited in CSIS (2025).

Your Privacy is a Compromise.

Alan Robertson — Tue, 17 Mar 2026 13:03:54 GMT

TLDR: Every platform in this guide has already made privacy decisions on your behalf. Most of those decisions benefit them, not you. This guide names the decision, the honest reward for accepting it, and the honest cost of changing it. 20 platforms. Verified March 2026. Download at the bottom of this piece.

The platforms covered span six categories.

Social: LinkedIn, Facebook, Instagram, TikTok, X.

Creator: Substack, Medium.

AI Tools: ChatGPT, ChatGPT Edu (Institutional), Claude, Gemini, Microsoft Copilot.

Workspace: Google Workspace, Slack, Zoom, Notion.

Research and Academic: ResearchGate, Academia.edu.

Devices: Meta Ray-Ban Smart Glasses, Apple Ecosystem, Voice Assistants (Alexa and Google Home).

FREE GUIDE HERE

The default setting is never chosen with your interests in mind.

It is chosen with the platform’s interests in mind.

That is not a conspiracy. It is a business model. And it has been operating on you, across every device you own, in every app you have open, on the platform you are reading this on, without you reading the terms.

This piece is about what that actually means. Not in the abstract. In the specific. And at the end of it there is a guide, free to download, that covers 20 platforms, every setting worth changing, and the honest cost of every decision.

In September 2024

…two Harvard students built a tool called I-XRAY. They attached it to a pair of Meta Ray-Ban smart glasses and walked around a university campus. The tool combined the glasses’ camera with publicly available facial recognition software. In real time, it identified strangers. Not just their faces. Their home addresses, their phone numbers, and the names of their family members.

The recording indicator light on the glasses is small. Most bystanders never notice it. There is no consent mechanism. The person being identified has no idea it is happening.

In February 2026, a leaked internal Meta memo confirmed the company is building exactly this capability natively into the next version of the glasses. They are calling it “Name Tag.” The memo noted the launch should be timed for when civil society groups are “focused on other concerns.”

That is one end of the spectrum.

Here is the other end.

Since November 2025, LinkedIn has used your posts, your career history, and your profile to train its AI systems and feed Microsoft’s advertising machine. By default. The setting exists. Most users have never seen it. It lives at Settings and Privacy, then Data Privacy, then Data for Generative AI Improvement.

The toggle appears off on the page. But the data processing continues unless you also submit a separate Data Processing Objection form. Two steps. One visible. One not.

Both of those things, the glasses and the toggle, are privacy stories. But they are not the same kind. The glasses are a product decision made without you. The toggle is a transaction you agreed to without reading it.

Most privacy coverage treats these as equivalent failures. They are not. One requires legislation. The other requires a setting change and thirty seconds of your time.

This guide is about the second kind.

The problem with privacy advice is that it usually does one of two things.

It tells you to turn everything off, without telling you what you lose. Or it tells you the platform is terrible, without telling you what to actually do about it.

Neither is useful. Because privacy is not binary.

Turning off LinkedIn’s AI training means your writing stays yours. It also means the AI features that have learned your career voice go generic. That is a real trade. You should know about it before you make it.

Turning off Facebook’s Off-Facebook Activity tracking is one of the highest-impact privacy changes available. It also means every service you accessed via Facebook login now requires a separate account. That is a real cost. You should know about it.

Turning off Gemini’s Smart Features in Gmail stops Google’s AI reading your email. But there are two separate toggles. Most people find one and assume they are done. They are not.

The TikTok question is not really about settings at all. The 2026 privacy policy explicitly added precise GPS location collection. Biometric data is collected. ByteDance is subject to Chinese national security law, which requires cooperation with state intelligence services on request. The settings you can change are real and worth changing. But they do not change what TikTok is.

These are not equivalent risks. A tool that helps you understand the difference between them is more useful than one that treats every platform the same.

That is what this guide attempts to do.

For each of the 20 platforms it covers, it names the default trap: what the platform decided for you without asking. It names the trade: the honest reward for accepting that default, and the honest risk of doing so. And for every setting worth changing, it gives you the verified path, the genuine benefit of leaving it on, and the genuine cost of turning it off.

Every setting path was verified in March 2026.

A few things worth knowing before you read it.

On Facebook, the face recognition setting that appears in many privacy guides has been removed. Meta shut down the feature in November 2021 and deleted over a billion facial recognition templates. Any guide still listing it is out of date.

On Instagram, the end-to-end encrypted DM feature some users opted into is being permanently removed on May 8, 2026. Download your encrypted message history before that date. After May 8, Meta can read all Instagram direct messages.

On X, the Grok AI training opt-out exists only on desktop. It is not available in the mobile app. Millions of users who have only ever used X on their phones have never had access to it.

On Apple, Advanced Data Protection, the setting that end-to-end encrypts your iCloud backups, Photos, and Notes, was removed for UK users in February 2025, following a demand from the UK government under the Investigatory Powers Act. New UK users cannot enable it. UK iCloud data remains accessible to Apple and to UK authorities with a legal warrant. Any guide recommending this setting without that caveat is not written for a UK audience.

On Claude, the AI assistant you may be using to draft strategy, summarise documents, or work through client problems, the September 2025 terms update introduced an opt-in training toggle for Free, Pro, and Max accounts. Opting in extends data retention from 30 days to 5 years. A 60x increase. If you accepted the updated terms without reading them, check Settings, then Privacy, then Help improve Claude.

The goal of this guide is not maximum privacy.

Maximum privacy means leaving every platform, disabling every feature, and accepting every cost. For most people that is not realistic or desirable. LinkedIn reach matters. Gmail Smart Compose saves time. ChatGPT memory makes the tool more useful. These are real benefits and the guide treats them as such.

The goal is informed choice. To know what the deal is before you accept it. To understand what you gain by leaving a setting on, and what you lose by turning it off. To make the decision consciously rather than by default.

Every platform covered in this guide made a choice about where to set the starting point. That choice was not made with your interests in mind. It was made with theirs.

The guide is an attempt to give you the information you need to decide whether their starting point is also yours.

Download the full guide below. It is free.

If you find it useful, the most useful thing you can do is share it with colleagues, with your team, with anyone who uses these platforms and has never thought about what they agreed to.

And if you want more of this: forensic, verified, no flannel. discarded.ai tracks the gap between what AI companies promise and what they actually do. Subscribe free. New pieces when there is something worth saying.

[DOWNLOAD: Your Privacy is a Compromise | discarded.ai | Settings verified March 2026]

discarded.ai

Tracking the gap between AI promises and AI reality.

00: Technical AI Safety: Training: Bluedot.org

Alan Robertson — Sun, 15 Mar 2026 16:22:46 GMT

March 2026

Tuesday is the day I start my next Bluedot impact course:

Technical AI Safety

Before I start - I recommend Blue.dot training to everybody. The courses are facilitated and only require a donation at the end of the course.

This article is on Technical AI Safety course: Please note Bluedot recommend attending the AGI Strategy course first which I completed in February 2026

Each week I will summarise progress and reading and perhaps you too may decide to apply for this course.

The AGI strategy course focused on the question: “How do we make AI go well?”

On the course, you identified the future you’re working toward and understood the key dynamics:

Drivers of AI progress: compute, data, algorithms
Threat pathways: power concentration, gradual disempowerment, catastrophic pandemics, critical infrastructure collapse
Plans for making AI go well: government control over AGI, hand over control to aligned superintelligence, build defences and diffuse AI
Layers of defences to build: prevent dangerous AI actions → constrain dangerous AI capabilities → withstand dangerous AI actions

This course focuses on defining what AI systems we are building and how.

You will gain the technical foundation to understand what it will actually take to make AI systems safer – and why it’s so challenging.

Throughout the rest of the course, I will:

Diagnose why making AI safe is technically challenging
Evaluate current safety techniques: what works, what doesn’t, where the gaps are
Build your own “kill chain” showing how defences might break
Identify the most promising intervention point for your contribution
Leave with a fundable action plan to start shipping

What this course isn’t

Though important for making AI go well, Bluedot will cover the following in a separate course:

AI policy details: though you’ll gain the technical grounding for effective AI governance
Compute governance: hardware verification and tracking deserve their own deep dive
AI security: e.g. preventing model theft or escape
ML basics: complete our AI foundations modules first if you need them

Let’s start with the question: “How might we build safe AI?”