Critical Investigation & Policy Framework
First, Do No Harm: How to safely deploy AI in healthcare
The AI industry has had its period of voluntary self-governance. It has not used that period well. We examine the failures and chart a path toward mandatory accountability.
Investigative Case Database
The failures documented below are not random. They share common structural causes. What are those causes and what can we learn from them?
Severity Distribution
- High Severity: Injury, systemic bias, or clinical mismanagement.
- Medium Severity: Financial waste or dangerous but intercepted recommendations.
Investigation Cases
TruDi Navigation System (Acclarent / J&J)
Year: 2021-2025 | Severity: High
Category: Hardware/AI Integration Failure
Nature of Failure: Reported errors in surgical navigation led to misidentified anatomy during ENT procedures. Reports of AI 'misidentifying body parts' resulting in botched surgeries and risks to brain and eye structures.
Clinical Impact: Serious surgical complications and revisions. Highlights the danger of AI-assisted navigation without redundant manual verification.
CheXzero Bias Investigation
Year: 2024 | Severity: High
Category: Algorithmic Bias
Nature of Failure: AI chest X-ray models (CheXzero) exhibited significant performance gaps: underdiagnosing diseases in Black and female patients at much higher rates than in white or male patients.
Clinical Impact: Systemic health inequity. Missed lung diseases and delayed treatments for underserved populations who are already marginalized by the healthcare system.
First-Wave AI-Designed Drugs
Year: 2023-2024 | Severity: Medium
Category: Biotech/Drug Discovery Failure
Nature of Failure: Multiple highly-publicized drug candidates designed by AI (e.g., Exscientia, BenevolentAI) failed in Phase I and Phase II clinical trials due to lack of efficacy or safety issues.
Clinical Impact: Billions in lost investment and years of research time. Tempered the hype that AI 'magic' can bypass the fundamental complexities of human biology.
AI Fetal Ultrasound Systems
Year: 2024 | Severity: Medium
Category: Diagnostic Failure
Nature of Failure: Automated biometric systems misidentified fetal body parts or provided incorrect gestational measurements during routine scans.
Clinical Impact: Incorrect prenatal planning and unnecessary psychological stress for parents. Risks missing critical developmental anomalies.
AI Heart Monitor Algorithms
Year: 2023 | Severity: High
Category: Clinical Monitoring Failure
Nature of Failure: Wearable AI monitors failed to detect significant arrhythmias (like Atrial Fibrillation) in certain physiological conditions, providing a false sense of security.
Clinical Impact: Delayed cardiac intervention. Patients may ignore physical symptoms because the 'AI' stated their heart rhythm was normal.
Epic Sepsis Model
Year: 2021–2023 | Severity: High
Category: Predictive Model Failure
Nature of Failure: Missed 67% of actual sepsis cases; generated alerts on 18% of all hospitalised patients (86% false-alarm rate). Post-hoc analysis revealed data leakage.
Clinical Impact: Severe 'alert fatigue': nurses covered cameras. Delayed antibiotic treatment in true sepsis patients. Model overhauled only after external publication.
Optum Racial Bias Algorithm
Year: 2019 | Severity: High
Category: Algorithmic Bias
Nature of Failure: Used 'healthcare cost' as a proxy for 'health need', failing to account for the fact that Black patients historically receive less care for the same severity of illness.
Clinical Impact: Black patients were assigned the same risk score as white patients who were significantly sicker. Affected millions annually.
Framework for Action
What Needs to Be Done: A Framework for Meaningful Regulation
The failures documented above are not random. They share common structural causes: insufficient pre-market validation, the absence of mandatory post-market surveillance, inadequate diversity requirements in training data, no meaningful accountability for developers when systems cause harm, misaligned incentives, and a regulatory environment that has consistently prioritised innovation speed over patient safety.
These causes have solutions. None of them are technically difficult. All of them require political will, and most require the healthcare AI industry to accept constraints it has successfully resisted to date.
1. Mandatory Prospective Clinical Validation
No AI system intended for clinical use should be deployed to patients without prospective validation in the clinical context in which it will be used. This is the standard applied to every pharmaceutical and every medical device that involves genuine patient risk.
Validation must include performance across demographic subgroups and integration into actual workflows.
2. Mandatory Post-Market Surveillance
The 'zombie algorithm' problem, i.e. the degradation of deployed AI systems, is preventable. Systems should report performance metrics to a central registry in near-real-time. Drops below thresholds should trigger review and temporary suspension.
Performance decay is not a hypothetical risk; it is a documented pattern.
3. Algorithmic Bias Audits
Bias in training data is a predictable consequence of historical inequities. Solution: require developers to demonstrate performance equity across demographic subgroups as a precondition of regulatory authorisation.
Independent third-party audits of training data composition must be mandatory.
4. Real Accountability & Liability
Ambiguous liability protects developers, not patients. Frameworks must establish clear, non-waivable liability for design failures, inadequate validation, or deployment with known performance limitations.
Liability is the basic condition that makes safety a commercial priority.
5. Moratorium on Unvalidated LLMs
General-purpose LLMs are not medical devices. Any LLM application providing medical advice or diagnostic suggestions should be regulated as a medical device, with all attendant safety requirements.
The fiction that 'general purpose' excludes medical device classification must end.
6. Independent IAEA for AI
We need an independent body modeled after the IAEA, with authority and resources to evaluate AI health technologies on clinical evidence and issue binding guidance.
Independent evaluation is the minimum standard that patients deserve.
7. Mandatory Transparency and Explainability
Black-box models create automation bias. Regulators should require systems in high-stakes contexts to provide meaningful explanations which can support critical appraisal rather than just being decorative.
Conclusion: The Chernobyl Threshold
The Chernobyl disaster did not happen because nobody knew that reactors could explode. It happened because the institutional, political, and commercial incentives were all aligned against saying so. Safety concerns were suppressed. Warning systems were overridden. The individuals who raised objections were overruled. And then, at 1:23 am on 26 April 1986, the consequences became undeniable.
Healthcare AI is not at 1:23 am yet. People have been harmed. Some have died. The adverse event reports are accumulating. The recall rates are double what they should be. The bias data is published in peer-reviewed journals. The zombie algorithms are degrading in hospitals right now. But we have not yet had a major event which forces political reckoning.
"We do not need a Chernobyl. We have the evidence. We have the framework. What we need is the will to act before the disaster that makes inaction impossible."
The question is whether we will wait for one.
The case for meaningful, mandatory regulation of healthcare AI is not speculative. It is made by the patients in this report and their families who are pursuing wrongful death claims against a ghost in the machine.
Regulation is not the enemy of innovation — it is the condition under which innovation becomes worthy of the name. A medical technology that cannot demonstrate safety and equity across the population it serves, is not an innovation. It is an experiment conducted without consent, and the patients are the guinea pigs.
The anniversary of Chernobyl is a useful moment to reflect on what happens when the gap between claimed performance and actual performance is allowed to widen until it becomes catastrophic.
We know the lessons. The question is whether we will apply them.
