It always starts the same way.
Someone says the model is working fine. It passed validation. It improved operational efficiency by 17.3%. It got a standing ovation at the product demo.
Then the call comes in.
Not about the 17.3%. Not about efficiency. About the patient who got the wrong dose. The wrong diagnosis. Or no diagnosis at all. About the thing the AI missed. The pattern it saw that wasn’t there. Or the pattern it didn’t see that was.
That’s when they remember they never really audited it.
Not properly.
Because in healthcare, auditing AI is not a nice-to-have. It’s not the compliance team’s paperwork. It is the thing that stands between safety and suffering. And if we’re being honest, the industry is years behind where it needs to be.
So, let’s walk it through.
This is how I would audit AI in healthcare.
And no, it’s not a checklist. It’s a way of thinking. A way of paying attention.
Let’s start with the question we never ask enough.
1. What is the AI actually doing?
Not what the brochure says. Not what the dev team says. What the system is doing, line by line, inference by inference, patient by patient.
Every audit begins with a forensic unpacking of the use case. You need to get painfully clear on what problem the AI is solving. Is it prioritising patients in A&E? Generating discharge summaries? Detecting early signs of stroke on CT scans?
Because you can’t assess risk in a vacuum.
Ask: Is this AI diagnostic, predictive, or administrative? What decision is it informing or automating? Is it replacing a human judgment or supplementing it?
Then go deeper. What’s the decision space? Who acts on the output? How is it framed? Is it binary? Probabilistic? Does the model make the decision, or just nudge it?
This is your first audit line. If you can’t explain what the AI is doing, the audit ends here.
2. Where did the data come from? And who did it leave out?
The next audit line is data provenance.
I want to see the training set. I want to see the distributions. I want to see where the data was collected, and when. And I want someone to tell me who wasn’t in it.
AI in healthcare is only as safe as its blind spots. If the training data underrepresents ethnic minorities, women, elderly patients, disabled bodies, or rare diseases, the system will fail. It won’t do it loudly. It’ll do it quietly. In the margins. On the edge cases.
In the audit, I ask: Did they use real patient records or synthetic data? Was it annotated by clinicians? Was it balanced for age, race, sex, co-morbidities?
Show me the data lineage. Show me the exclusions. Show me the biases before they show up in the outcomes.
And no, “we used a large dataset” is not an answer.
3. What assumptions are baked into the model?
Every model makes assumptions. About what matters. About what counts as ‘normal.’ About what features are relevant, and which ones can be ignored.
That’s where the danger hides.
So I look at the model architecture. I inspect the feature engineering. I interrogate the labels. Were they derived from clinical outcomes? Physician notes? Insurance codes? Self-reports?
Who defined the ground truth? Did they question it?
If you’re auditing an AI that predicts hospital readmission and it uses “patient didn’t come back” as a success metric, it might just be predicting death.
The job here is to surface assumptions before they become operational errors. Ask: What does the model believe about the world? And what if that belief is wrong?
4. What are the failure modes? And are they catastrophic?
In clinical AI, failure is not theoretical.
I want a full risk taxonomy. False positives. False negatives. Outliers. Distribution shift. Input errors. Ambiguous outputs. Temporal drift.
What happens when the system fails?
If the AI is recommending chemotherapy protocols, a misclassification is not a rounding error. It’s not a KPI blip. It is a human body misdiagnosed, mistreated, or left untreated.
In the audit, you must map out the severity of each failure. Classify them. Attach them to harm scenarios. What is the worst-case plausible outcome?
Now ask: What’s been done to prevent it? To detect it? To recover from it?
If the answer is, “The model is 96% accurate,” stop the audit. That’s marketing, not safety.
5. Is the model explainable, or are we flying blind?
Explainability matters.
Not because humans need to interpret every mathematical weight. But because when something goes wrong, someone needs to understand why.
If I’m auditing an AI that contributes to access decisions for mental health services, I want to see which variables influenced that output. I want to test for spurious correlations. I want to probe the model with counterfactuals and boundary cases.
Tools like SHAP and the What-If Tool aren’t bonus features. They are audit baselines.
If the developers can’t explain how or why the model makes critical predictions, then the deployment must be accompanied by strong safeguards, oversight, and fallback processes.
Opaque models aren’t always disqualifying. But they require proportionate governance.
6. Who can see the data? And is it compliant with HIPAA?
You cannot talk about healthcare AI in the U.S. without talking about HIPAA.
Any AI system trained on or interacting with Protected Health Information (PHI) must comply with strict rules on privacy, access, and disclosure. That includes:
Access controls – Who can query or modify PHI?
Purpose limitation – Are data uses aligned with treatment, payment, or operations?
De-identification – Has PHI been anonymised in line with Safe Harbor or Expert Determination?
Audit trails – Can access and modification be logged and traced?
Vendor controls – Have Business Associate Agreements (BAAs) been signed with third-party AI developers?
And if the model was trained using PHI, even if now anonymised, organisations should consider whether downstream commercial use raises ethical or trust concerns, even if technically compliant.
You can be compliant and still wrong.
7. Is the model fair, or just accurate on average?
This is where you ask the question that most vendors avoid.
Not “is the model good,” but “who is it good for?”
Does the AI perform consistently across different demographic groups? Does it amplify existing disparities? Has it been evaluated using subgroup performance metrics, or just headline accuracy?
Audits should disaggregate results by race, gender, age, disability, language, and socioeconomic status. It’s not enough to know the model works. You need to know where and for whom it fails.
If it systematically underperforms on underrepresented groups, that’s not a minor defect. That’s a structural design flaw.
And it should be treated as such.
8. Is it regulated as Software as a Medical Device (SaMD)?
Many clinical AIs fall under Software as a Medical Device (SaMD) a designation by the FDA for software that performs medical functions.
That comes with responsibilities:
Pre-market review and risk classification
Change control and versioning policies
Post-market surveillance for real-world drift
Transparency on intended use and performance
If the AI assists with diagnosis, prognosis, treatment recommendation, or triage, you must verify its regulatory status.
Audit questions include:
Was it submitted to the FDA?
What risk class was assigned?
Is the current version still within the scope of original approval?
Has the vendor committed to transparency on updates?
No audit is complete without this.
Putting it together: A healthcare AI audit is not one audit. It’s five at once.
If you’re serious about safety, you need to audit:
Clinical Safety – Is it doing harm?
Statistical Performance – Is it working reliably?
Ethical Integrity – Is it fair and just?
Operational Governance – Is it accountable?
Legal and Regulatory Compliance – Is it lawful?
Each of these has its own failure modes. Each can break the system. Each can put a patient at risk.
And if you’re not auditing all five, you’re not really auditing anything.
Why this matters now.
Because in healthcare, the stakes are never theoretical.
A system might pass technical validation. It might meet regulatory minimums. It might win awards. But if it degrades over time, if it excludes patients who don’t fit the data, if it quietly shifts risk onto the vulnerable
then it doesn’t belong in a clinical setting.
Auditing is not a threat to innovation. It’s what makes innovation trustworthy.
It is the moment where we choose to notice the risk, before someone else is forced to live with it.



Thank you for writing this, it's helpful. I'd love to hear your thoughts on the possibilities for auditing for human values and if you see the possibility of a practical framework for auditing for ethical integrity.
This is a superb post, Alan. Thank you for providing such a comprehensive, yet beautifully clear and concise overview of the critical factors in clinical AI audits. I suspect that the same basic approach is equally applicable to many areas outside of healthcare (although the risks may not be so significant or clearcut).