Critical Investigation & Policy Framework

First, Do No Harm: How to safely deploy AI in healthcare

The AI industry has had its period of voluntary self-governance. It has not used that period well. We examine the failures and chart a path toward mandatory accountability.

Investigative Case Database

The failures documented below are not random. They share common structural causes. What are those causes and what can we learn from them?

Severity Distribution

High Severity: Injury, systemic bias, or clinical mismanagement.
Medium Severity: Financial waste or dangerous but intercepted recommendations.

Investigation Cases

TruDi Navigation System (Acclarent / J&J)

Year: 2021-2025 | Severity: High

Category: Hardware/AI Integration Failure

Nature of Failure: Reported errors in surgical navigation led to misidentified anatomy during ENT procedures. Reports of AI 'misidentifying body parts' resulting in botched surgeries and risks to brain and eye structures.

Clinical Impact: Serious surgical complications and revisions. Highlights the danger of AI-assisted navigation without redundant manual verification.

CheXzero Bias Investigation

Year: 2024 | Severity: High

Category: Algorithmic Bias

Nature of Failure: AI chest X-ray models (CheXzero) exhibited significant performance gaps: underdiagnosing diseases in Black and female patients at much higher rates than in white or male patients.

Clinical Impact: Systemic health inequity. Missed lung diseases and delayed treatments for underserved populations who are already marginalized by the healthcare system.

First-Wave AI-Designed Drugs

Year: 2023-2024 | Severity: Medium

Category: Biotech/Drug Discovery Failure

Nature of Failure: Multiple highly-publicized drug candidates designed by AI (e.g., Exscientia, BenevolentAI) failed in Phase I and Phase II clinical trials due to lack of efficacy or safety issues.

Clinical Impact: Billions in lost investment and years of research time. Tempered the hype that AI 'magic' can bypass the fundamental complexities of human biology.

AI Fetal Ultrasound Systems

Year: 2024 | Severity: Medium

Category: Diagnostic Failure

Nature of Failure: Automated biometric systems misidentified fetal body parts or provided incorrect gestational measurements during routine scans.

Clinical Impact: Incorrect prenatal planning and unnecessary psychological stress for parents. Risks missing critical developmental anomalies.

AI Heart Monitor Algorithms

Year: 2023 | Severity: High

Category: Clinical Monitoring Failure

Nature of Failure: Wearable AI monitors failed to detect significant arrhythmias (like Atrial Fibrillation) in certain physiological conditions, providing a false sense of security.

Clinical Impact: Delayed cardiac intervention. Patients may ignore physical symptoms because the 'AI' stated their heart rhythm was normal.

Epic Sepsis Model

Year: 2021–2023 | Severity: High

Category: Predictive Model Failure

Nature of Failure: Missed 67% of actual sepsis cases; generated alerts on 18% of all hospitalised patients (86% false-alarm rate). Post-hoc analysis revealed data leakage.

Clinical Impact: Severe 'alert fatigue': nurses covered cameras. Delayed antibiotic treatment in true sepsis patients. Model overhauled only after external publication.

Optum Racial Bias Algorithm

Year: 2019 | Severity: High

Category: Algorithmic Bias

Nature of Failure: Used 'healthcare cost' as a proxy for 'health need', failing to account for the fact that Black patients historically receive less care for the same severity of illness.

Clinical Impact: Black patients were assigned the same risk score as white patients who were significantly sicker. Affected millions annually.

Framework for Action

What Needs to Be Done: A Framework for Meaningful Regulation

The failures documented above are not random. They share common structural causes: insufficient pre-market validation, the absence of mandatory post-market surveillance, inadequate diversity requirements in training data, no meaningful accountability for developers when systems cause harm, misaligned incentives, and a regulatory environment that has consistently prioritised innovation speed over patient safety.

These causes have solutions. None of them are technically difficult. All of them require political will, and most require the healthcare AI industry to accept constraints it has successfully resisted to date.

1. Mandatory Prospective Clinical Validation

No AI system intended for clinical use should be deployed to patients without prospective validation in the clinical context in which it will be used. This is the standard applied to every pharmaceutical and every medical device that involves genuine patient risk.

Validation must include performance across demographic subgroups and integration into actual workflows.

2. Mandatory Post-Market Surveillance

The 'zombie algorithm' problem, i.e. the degradation of deployed AI systems, is preventable. Systems should report performance metrics to a central registry in near-real-time. Drops below thresholds should trigger review and temporary suspension.

Performance decay is not a hypothetical risk; it is a documented pattern.

3. Algorithmic Bias Audits

Bias in training data is a predictable consequence of historical inequities. Solution: require developers to demonstrate performance equity across demographic subgroups as a precondition of regulatory authorisation.

Independent third-party audits of training data composition must be mandatory.

4. Real Accountability & Liability

Ambiguous liability protects developers, not patients. Frameworks must establish clear, non-waivable liability for design failures, inadequate validation, or deployment with known performance limitations.

Liability is the basic condition that makes safety a commercial priority.

5. Moratorium on Unvalidated LLMs

General-purpose LLMs are not medical devices. Any LLM application providing medical advice or diagnostic suggestions should be regulated as a medical device, with all attendant safety requirements.

The fiction that 'general purpose' excludes medical device classification must end.

6. Independent IAEA for AI

We need an independent body modeled after the IAEA, with authority and resources to evaluate AI health technologies on clinical evidence and issue binding guidance.

Independent evaluation is the minimum standard that patients deserve.

7. Mandatory Transparency and Explainability

Black-box models create automation bias. Regulators should require systems in high-stakes contexts to provide meaningful explanations which can support critical appraisal rather than just being decorative.

Conclusion: The Chernobyl Threshold

The Chernobyl disaster did not happen because nobody knew that reactors could explode. It happened because the institutional, political, and commercial incentives were all aligned against saying so. Safety concerns were suppressed. Warning systems were overridden. The individuals who raised objections were overruled. And then, at 1:23 am on 26 April 1986, the consequences became undeniable.

Healthcare AI is not at 1:23 am yet. People have been harmed. Some have died. The adverse event reports are accumulating. The recall rates are double what they should be. The bias data is published in peer-reviewed journals. The zombie algorithms are degrading in hospitals right now. But we have not yet had a major event which forces political reckoning.

"We do not need a Chernobyl. We have the evidence. We have the framework. What we need is the will to act before the disaster that makes inaction impossible."

The question is whether we will wait for one.

The case for meaningful, mandatory regulation of healthcare AI is not speculative. It is made by the patients in this report and their families who are pursuing wrongful death claims against a ghost in the machine.

Regulation is not the enemy of innovation — it is the condition under which innovation becomes worthy of the name. A medical technology that cannot demonstrate safety and equity across the population it serves, is not an innovation. It is an experiment conducted without consent, and the patients are the guinea pigs.

The anniversary of Chernobyl is a useful moment to reflect on what happens when the gap between claimed performance and actual performance is allowed to widen until it becomes catastrophic.

We know the lessons. The question is whether we will apply them.

First, Do No Harm: How to Deploy AI in healthcare

First, Do No Harm: How to safely deploy AI in healthcare

Investigative Case Database

Severity Distribution

Investigation Cases

TruDi Navigation System (Acclarent / J&J)

CheXzero Bias Investigation

First-Wave AI-Designed Drugs

AI Fetal Ultrasound Systems

AI Heart Monitor Algorithms

Epic Sepsis Model

Optum Racial Bias Algorithm

What Needs to Be Done: A Framework for Meaningful Regulation

1. Mandatory Prospective Clinical Validation

2. Mandatory Post-Market Surveillance

3. Algorithmic Bias Audits

4. Real Accountability & Liability

5. Moratorium on Unvalidated LLMs

6. Independent IAEA for AI

7. Mandatory Transparency and Explainability

Conclusion: The Chernobyl Threshold

Leave a Reply Cancel reply

First, Do No Harm: How to safely deploy AI in healthcare

Investigative Case Database

Severity Distribution

Investigation Cases

TruDi Navigation System (Acclarent / J&J)

CheXzero Bias Investigation

First-Wave AI-Designed Drugs

AI Fetal Ultrasound Systems

AI Heart Monitor Algorithms

Epic Sepsis Model

Optum Racial Bias Algorithm

What Needs to Be Done: A Framework for Meaningful Regulation

1. Mandatory Prospective Clinical Validation

2. Mandatory Post-Market Surveillance

3. Algorithmic Bias Audits

4. Real Accountability & Liability

5. Moratorium on Unvalidated LLMs

6. Independent IAEA for AI

7. Mandatory Transparency and Explainability

Conclusion: The Chernobyl Threshold

Please Share This Share this content

You Might Also Like

Staff doses in Interventional Radiology: Benchmarking and Optimization

Virtual in silico trials: the good, the bad and the ugly

The invisible threat: Infection Control in Radiology

Leave a Reply Cancel reply

Share this content