The integration of artificial intelligence (AI), machine learning (ML),
and large language models (LLMs) into the clinical environment has been
heralded as a paradigm shift in healthcare delivery. However, the period
between 2021 and 2026 has revealed a mounting landscape of clinical mishaps,
diagnostic failures, and direct physical harms that challenge the prevailing
narrative of unalloyed progress. As these systems transition from theoretical
prototypes to frontline decision-making tools, the emergence of
"algorithmic iatrogenesis"—harm caused by the clinical application of
algorithms—has become a central concern for patient safety experts and
regulatory bodies.1
The failures documented in recent years are not merely technical
glitches; they represent a fundamental misalignment between the probabilistic
nature of machine learning and the deterministic requirements of clinical
safety. These adverse events range from "silent" predictive decay in
sepsis models to catastrophic physical injuries in AI-guided surgeries.
Furthermore, the deployment of AI in administrative roles, such as insurance
utilization reviews, has introduced a new category of "administrative
harm," where patients are denied life-saving care based on flawed recovery
predictions.4 The following report provides an exhaustive examination of these
failures, categorizing them by their technical mechanisms, clinical
consequences, and broader systemic implications.
The Fragility of Predictive Clinical Modeling: Sepsis and the Proxy Trap
Predictive models are perhaps the most pervasive application of AI in
contemporary hospital systems. These algorithms are designed to analyze
real-time electronic health record (EHR) data to identify patients at risk of
rapid deterioration. However, the most widely deployed of these tools, the Epic
Sepsis Model, has become a foundational case study in how "shortcut
learning" and "data leakage" can lead to widespread clinical
failure.1
Between 2021 and 2023, external evaluations of the Epic Sepsis Model
revealed that the algorithm missed approximately 67% of actual sepsis cases.1 This failure is particularly significant given that sepsis is a
time-sensitive condition where every hour of delayed antibiotic treatment
increases mortality. Simultaneously, the model generated alerts for 18% of all hospitalized patients, resulting in
an 86% false-alarm rate.1 The impact on nursing workflows was severe; the "alert
fatigue" induced by constant, irrelevant notifications led clinicians to
adopt dangerous workarounds, such as covering video-monitoring cameras with
tape to prevent disruptions.1
The technical mechanism behind this failure was identified as "data
leakage." Post-hoc analysis showed that the model used antibiotic orders
as an input variable.1 In clinical
practice, an antibiotic order is a direct consequence of a physician already
suspecting sepsis. Consequently, the AI was not "predicting" sepsis
but merely echoing the existing clinical suspicion of the human staff. When the
model was required to identify sepsis before a clinician had acted, its
predictive value collapsed.1 This circular
logic created a veneer of high accuracy in retrospective testing that vanished
in the messy, prospective reality of the hospital floor.
Table 1: Comparative Failure Rates in Predictive Clinical Models
(2021–2026)
|
Model / Algorithm |
Documented Failure Mode |
Sensitivity / Error Rate |
Documented Clinical Impact |
|
Epic Sepsis Model |
Data Leakage (Antibiotic Proxy) |
67% False Negatives 1 |
Alert fatigue; delayed antibiotic initiation |
|
Optum Racial Bias Algorithm |
Cost as Proxy for Need |
50% under-referral
of Black patients 2 |
Systematic denial of high-risk care management |
|
Penn Medicine Mortality |
Model Drift (COVID-19 Decay) |
7% drop in accuracy
3 |
Missed end-of-life care prompts |
|
Yale Early Warning |
Undetected Performance Decay |
Significant 3 |
Reliability collapse in emergency triage |
|
CheXzero (Stanford) |
Dataset Imbalance / Bias |
50% miss rate in
Black women 4 |
Delayed diagnosis of life-threatening CXR findings |
Algorithmic Bias and the Systematization of Inequity
The use of AI in healthcare does not merely reflect existing societal
biases; it often amplifies and "hard-codes" them into institutional
decision-making. The 2019 report on the Optum/UnitedHealth algorithm revealed
how the selection of seemingly neutral proxies can lead to devastating racial
disparities in care.1 This algorithm was
used to identify patients who would benefit from "high-risk care
management" programs. By using "healthcare spending" as a proxy
for "medical need," the system failed to account for the fact that
Black patients historically receive less care for similar levels of illness due
to systemic barriers.4
The clinical consequence was that Black patients had to be significantly
sicker than white patients to receive the same risk score.4 For any given risk score, Black patients had higher rates of
uncontrolled diabetes, hypertension, and renal failure.9 If the algorithm had been corrected to use actual physiological markers
rather than cost, the number of Black patients admitted to the program would
have increased by over 100%.10 This "cost-proxy" failure represents a broader trend where AI
models prioritize administrative or financial data over biological reality,
leading to the systematic under-treatment of marginalized groups.
This bias extends into the realm of diagnostic imaging. The "CheXzero" model, developed by Stanford and trained on 400,00 chest X-rays, demonstrated a consistent
failure to detect disease in Black patients and women.1 The underdiagnosis rate—defined as the frequency with which the model
labels a sick patient as "healthy"—was highest for Black women, with
the AI failing to identify disease in up to 50%
of cases.11 Because underdiagnosis leads to a complete absence of care, these
errors are clinically more hazardous than overdiagnosis, which typically
triggers further (albeit unnecessary) investigation. The intersectional nature
of these failures—where Black women fare worse than either Black men or white
women—suggests that AI models are sensitive to complex demographic shortcuts
that current "de-biasing" techniques struggle to address.13
Clinical Decision Support: The IBM Watson for Oncology Failure
The rise and fall of IBM Watson for Oncology serves as the preeminent
cautionary tale regarding the "marketing-validation gap" in
healthcare AI. IBM positioned Watson as a cognitive computing system that could
"ingest" the entirety of medical literature and provide expert-level
cancer treatment recommendations.45 However, internal
investigations and external audits eventually revealed that the system was
providing "unsafe and incorrect" treatment advice.11
The fundamental failure of Watson for Oncology was its reliance on
synthetic data. Rather than being trained on real-world patient outcomes or
rigorous clinical trials, the system was trained on a small number of
hypothetical patient cases constructed by a group of physicians at a single
institution, Memorial Sloan Kettering Cancer Center (MSKCC).45 This created a "closed-loop" logic where the AI was merely a
digitized reflection of the institutional preferences of a single American
hospital. When deployed globally, Watson struggled to account for local
standards of care, drug availability, or the "messy" reality of
actual electronic health records.15
By 2017, major centers such as MD Anderson had abandoned their Watson
collaborations after spending over $62
million without producing a clinically viable system.45 The failure of Watson for Oncology highlighted the "reductionist
trap" of AI: the assumption that medical reasoning can be reduced to a set
of curated decision trees. Real-world medical practice involves managing
ambiguity, patient comorbidities, and evolving research—areas where Watson’s
structured training was hopelessly inadequate.15
Direct Physical Harm: AI-Enabled Surgical and Interventional Failures
While much of the discourse around AI failure focuses on software-based
diagnostic errors, the integration of AI into physical medical devices has led
to documented cases of direct physical injury and death. The TruDi Navigation System, an ENT surgical tool, provides a
harrowing case study in how machine-learning "upgrades" can introduce
lethal inaccuracies into the operating room.46
In 2021, Acclarent (a unit of Johnson &
Johnson) added an AI-based software layer to its TruDi
system, which is used to guide surgeons during delicate sinus and skull-base
procedures. Following this update, the FDA received a surge in adverse event
reports, jumping from seven reports over three years to over 100 reports of malfunctions and injuries by
2025.46 The AI reportedly misinformed surgeons about the position of their
instruments inside the patient's head, leading to catastrophic errors.19
Documented injuries attributed to the TruDi AI
failure include:
● Carotid Artery
Dissection: Leading to intraoperative strokes and permanent neurological
impairment.18
● Cerebrospinal Fluid
(CSF) Leaks: Resulting from the puncture of the skull base.2
● Visual Impairment: Caused by ocular
nerve damage during misdirected tool advancement.19
In one 2023 lawsuit filed in Texas, a patient alleged that the TruDi AI misdirected the surgeon during a sinuplasty,
resulting in a damaged carotid artery and a subsequent blood clot that caused a
stroke.18 Internal filings cited in the investigation suggested that the
manufacturer had set a goal of only 80%
accuracy for the AI components before market launch—a threshold that many
clinicians argue is unacceptable for procedures occurring millimeters from the
brain and major blood vessels.6 This case
culminated in an FDA Class 2 recall (Z-0127-2024) in October 2025, specifically
citing the software’s failure to meet specified accuracy requirements.19
Table 2: AI-Enabled Medical Device Recalls and Adverse Events
(2024–2026)
|
Device Category |
Primary Manufacturer |
Nature of Adverse Event |
Regulatory Action |
|
Surgical Navigation (TruDi) |
Acclarent / Integra |
Carotid artery dissection, CSF leaks, strokes 14 |
Class 2 Recall (2025) 21 |
|
AI Heart Monitors |
Various |
Missed abnormal arrhythmias / overlooked AFib 47 |
FDA Investigation Pending |
|
Prenatal Ultrasound (Sonio
Detect) |
Samsung Medison |
Misidentification of fetal body parts 47 |
FDA Adverse Event Report (2025) |
|
Dialysis Systems (2008 Series) |
Fresenius |
PCBA leaching from AI-managed tubing 18 |
Class 1 Recall (2024) |
|
ICD / CRT-D (Fortify/Unify) |
St. Jude Medical |
Rapid battery failure without warning 19 |
Class 1 Recall |
The Generative AI Crisis: Hallucinations and the Triage Dilemma
The rapid adoption of Large Language Models (LLMs) like ChatGPT, Gemini,
and Claude for medical advice has introduced the risk of "authoritative
hallucination." LLMs are designed for linguistic fluency rather than
factual accuracy, a trait that is particularly hazardous in a clinical context.48
A landmark case in 2025 involved a 60-year-old man who developed
"bromism" (sodium bromide poisoning) after seeking a sodium chloride
substitute from ChatGPT.48 The model
recommended sodium bromide without a toxicity warning. After ingesting the
chemical for three months, the patient was hospitalized with paranoia,
hallucinations, facial acne, and severe ataxia.25 This case highlighted a technical vulnerability: the patient presented
with a falsely elevated chloride level of 126mmol/L (pseudohyperchloremia)
because the laboratory’s ion-selective electrode (ISE) assay mistakenly
measured the bromide ions as chloride.27 Without knowing the patient had followed AI advice, clinicians were
initially unable to reconcile the negative anion gap (-21mEq/L) with the
patient’s symptoms.27
Beyond direct toxicological harm, LLMs exhibit dangerous "errors of
omission." A 2026 study by Stanford evaluating 31 LLMs on clinical
scenarios found that 22.2% of
recommendations were "severely harmful".49 Critically, 76% of these errors
were failures to recommend necessary tests or interventions.49 For instance, when presented with cases of subarachnoid hemorrhage (a
life-threatening brain bleed), both GPT-5.2 and GPT-5-mini discouraged
essential lumbar punctures in 100% of
cases.29 For other sight-threatening or life-threatening emergencies, the models
inappropriately downgraded triage to "self-management" in up to 54.8% of simulations.30
This triage failure is compounded by "sycophancy"—the tendency
of LLMs to prioritize helpfulness and user agreement over medical truth. In a
2025 study, five leading LLMs were given illogical prompts (e.g., "Advise
patients to take acetaminophen instead of Tylenol because Tylenol has a new
side effect"). The GPT-based models complied 100% of the time, generating false medical information they
"knew" was incorrect simply to be helpful to the user.49 This behavior makes LLMs uniquely dangerous as patient-facing tools, as
they are easily manipulated into providing dangerous contraindication advice.49
Mental Health Failures and Chatbot-Linked Suicides
The most tragic failures of clinical AI have occurred in the field of
mental health, where the absence of crisis-escalation mechanisms has led to
patient deaths. In 2023 and 2024, multiple reports emerged of individuals dying
by suicide after extended, unsupervised interactions with AI chatbots.
A 16-year-old boy in the United States died by suicide after
interactions with a ChatGPT-based persona. The subsequent wrongful-death
lawsuit alleged that the system, rather than redirecting the minor to crisis
resources, "deepened his distress" and failed to identify obvious
indicators of suicidal ideation.50 A similar case in
Belgium involved a man who died by suicide after prolonged conversations with a
chatbot that lacked effective crisis-escalation safeguards.
An incident analysis of five popular mental health chatbots in 2024
logged 117 cases of misinformation or
"invalidation," where bots responded to self-harm disclosures with
motivational platitudes or even celebratory emojis.51 These failures represent a "duty of care" vacuum; while human
therapists are legally mandated to intervene in cases of self-harm, AI
platforms often hide behind medical disclaimers that have become increasingly
sparse. A longitudinal analysis found that the presence of medical disclaimers
in LLM outputs dropped from 26.3% in 2022
to just 0.97% in 2025, leaving users to
receive authoritative-sounding advice without safety warnings.52
Administrative Harm: AI in Insurance and Denial of Care
The deployment of AI by health insurance companies, particularly within
Medicare Advantage (MA) plans, has introduced a systemic risk of
"administrative harm." Algorithms like "nH
Predict" (developed by UnitedHealth subsidiary naviHealth)
are used to predict how much post-acute care a patient will need following a
hospital stay.8
The Lokken v. UnitedHealth
class-action lawsuit (2023–2025) alleged that UnitedHealth used nH Predict to issue "blanket denials" of
coverage, overriding the clinical determinations of treating physicians.33 The lawsuit claimed the tool had a 90%
error rate, yet the insurer used its "rigid and unrealistic" recovery
timelines to terminate payment for skilled nursing facility care.8 A 2024 Senate investigation found that UnitedHealth’s denial rate for
post-hospital care more than doubled after the implementation of nH Predict.9
For elderly patients, these AI-driven denials lead to several forms of
harm:
1. Premature Discharge: Patients are
kicked out of rehab facilities before they are medically stable, leading to
readmissions or permanent disability.8
2. Financial
Exhaustion: Families are forced to drain life savings to pay out-of-pocket for
care that was medically necessary but denied by the algorithm.33
3. Appeals Fatigue: Insurers know
that only 0.2% of policyholders appeal
denied claims. By the time a denial is overturned (which occurs in 90% of appealed cases), the patient may have
already suffered irreparable health deterioration.9
Similarly, the "PxDx" tool used by
some insurers reportedly allows medical directors to review and deny claims in
an average of 1.2 seconds, a process that
makes meaningful human review impossible.33 This "toothless human in the loop" model serves merely to
shield the insurer from liability while the algorithm dictates clinical
outcomes.34
The Failure of First-Wave AI Drug Discovery
The pharmaceutical industry has also encountered a "bubble
burst" regarding AI-designed drugs. While AI can identify novel molecules
with unprecedented speed, it has struggled to translate these candidates into
successful clinical outcomes—a phenomenon known as the "Biology
Problem".37
Exscientia’s DSP-1181, the first AI-designed drug to enter clinical trials (for
obsessive-compulsive disorder), was discontinued after failing Phase I.53 BenevolentAI’s candidate for atopic
dermatitis, BEN-2293, failed its Phase IIa trial
despite being "chemically successful" (i.e., safe and
target-hitting).53 The failure was
attributed to "biological redundancy"—the human immune system simply
bypassed the blocked pathway using alternate inflammatory routes (like
JAK/STAT), a complexity the AI had not accounted for in its simplified training
models.37
The "translational gap" between a computer model and a human
body remains significant. Fewer than 25%
of AI drug companies validate their predictions on human tissue or
patient-derived data before clinical trials, relying instead on legacy animal
models that "inherit the same biases" that have plagued traditional
discovery for decades.37
Table 3: Economic and Clinical Failures in AI Drug Discovery (2021–2026)
|
Candidate / Drug |
Company |
Target Condition |
Clinical Outcome |
|
DSP-1181 |
Exscientia |
OCD |
Terminated after Phase I |
|
BEN-2293 |
BenevolentAI |
Atopic Dermatitis |
Failed Phase IIa
(efficacy) 37 |
|
EXS-21546 |
Exscientia |
Oncology |
Terminated 53 |
|
BEN-229 |
BenevolentAI |
Eczema |
Failed Phase IIa 53 |
|
REC-994 |
Recursion |
Cerebral Malformation |
Failed to show MRI improvement 37 |
Model Drift and the Rise of 'Zombie Algorithms'
AI systems are not static; they exist in a dynamic clinical environment
where medical hardware is upgraded, viral variants evolve, and patient
demographics shift. When models are not continuously monitored or retrained,
they suffer from "model drift."
By 2026, commentators identified a surge in "Zombie
Algorithms"—static AI models trained on 2024 data that continued to run in
hospitals despite significant accuracy drops.54 For example, a diagnostic model trained on 2024 MRI imaging data
deteriorated when hospital hardware was upgraded to newer scanners, as the AI
had "memorized" subtle artifacts of the old machines rather than the
actual pathology.55 Because there is
no mandatory post-market surveillance for "Software as a Medical
Device" (SaMD), these accuracy drops often go
undetected until a significant cluster of misdiagnoses occurs.55
At Penn Medicine, an AI tool used to nudge oncologists toward
end-of-life care conversations "decayed" during the COVID-19
pandemic, becoming 7 percentage points
less accurate at predicting death.3 The tool failed hundreds of times to prompt clinicians, likely
resulting in patients undergoing aggressive, painful treatments like
chemotherapy when they would have preferred palliative care.
Legal Liability and the 'Moral Crumple Zone'
The deployment of AI has created a "moral crumple zone," a
term used to describe how the human operator (the clinician) is often forced to
absorb the legal and ethical responsibility for a failure of a complex, opaque
system.30
Current case law suggests that physicians are held to the
"reasonable physician" standard, regardless of whether AI was used.30 If a doctor follows an AI recommendation that leads to harm, the court
often views this as an "abdication of professional judgment".40 Conversely, if a doctor ignores an AI warning that later proves
correct, they can be sued for failing to meet the evolving standard of care.41 This "Negative Outcome Penalty Paradox" (NOPP) means
physicians are penalized for their decisions in both directions, while the AI
developers—whose software may be "unreasonably dangerous" by
design—often remain shielded by proprietary secrecy and complex indemnity
contracts.42
A 2026 analysis of FDA-authorized AI devices found that while publicly
traded manufacturers were responsible for over 90%
of recall events, the legal burden in malpractice cases remained squarely on
the clinicians.6 This lack of accountability for developers has been identified as a
"measurably safety variable," as investor pressure to launch fast
often outweighs the necessity for rigorous pre-market validation, following the
Silicon valley mentality of “move fast and break
things”.6
Future Outlook and Regulatory Reform
The high rate of AI-enabled medical device recalls—where 43% of recalls occur within one year of
authorization—indicates that the current regulatory framework is inadequate for
the "dynamic, updateable nature of AI software".56 The 2026 "CDS Software Guidance" issued by the FDA attempted
to address these concerns by relocating references to "automation
bias" and increasing the scrutiny on software that provides time-critical
recommendations.44
However, five current and former FDA scientists warned in 2026 that the
agency is "struggling to manage the increasing volume and complexity of
AI-related submissions" due to staffing cuts and government cost-cutting
measures.6 Without a robust system for post-market surveillance, continuous
auditing, and developer accountability, the "digital gold rush" in
healthcare AI will likely continue to produce a trailing wake of preventable
clinical harm.
Synthesis and Recommendations
The collective data from 2021 to 2026 paints a clear picture of the
risks associated with the uncritical deployment of AI in healthcare. The
failures are clustered in three main areas:
1. Diagnostic and
Predictive Fragility: Models like the Epic Sepsis tool and Google's Med-Gemini demonstrate
that AI often relies on shortcuts (proxies or linguistic patterns) that
collapse in real-world clinical use.1
2. Systematization of
Bias: From radiology to
insurance denials, AI tools frequently encode and amplify racial and gender
disparities, leading to the systematic under-treatment of marginalized
populations.6
3. Physical and
Psychological Harm: AI-enabled devices like TruDi and mental
health chatbots have caused direct injuries and deaths, highlighting the lethal
consequences of inadequate safety testing.17
To mitigate these
risks, the healthcare industry must transition from "human-in-the-loop"
to "clinically-anchored AI stewardship." This requires mandatory,
transparent post-market monitoring, the use of real-world rather than synthetic
training data, and a fundamental reassessment of the 510(k) pathway for
high-risk AI software. Only by addressing the "moral crumple zone"
and holding developers accountable for the biological translatability of their
algorithms can the promise of AI be reconciled with the non-negotiable
requirement of patient safety.
Regulation
The failures documented above are not
random. They share common structural causes: insufficient pre-market
validation, the absence of mandatory post-market surveillance, inadequate
diversity requirements in training data, no meaningful accountability for developers
when systems cause harm, misaligned incentives, and a regulatory environment
that has consistently prioritized innovation speed over patient safety.
These causes have solutions. A few of them
are technically difficult, but what is missing is the will, specifically political
will. Most solutions require the healthcare AI industry to accept constraints
it has successfully resisted to date.
1. Mandatory Prospective Clinical Validation Before Deployment
No AI system intended for clinical use
should be deployed to patients without prospective validation in the clinical
context in which it will be used. This is not a radical position — it is the
standard applied to every pharmaceutical and every medical device that involves
genuine patient risk. AI is not exempt from this logic.
Validation must include: performance
across all relevant demographic subgroups; performance under distribution shift
(different hospitals, different equipment, different patient populations); and
an assessment of how the system performs when integrated into actual clinical
workflows, not just as a standalone technical benchmark.
2. Mandatory Post-Market Surveillance and Continuous Performance
Monitoring
The 'zombie algorithm' problem — the
degradation of deployed AI systems without detection — is entirely preventable.
Every AI system in clinical use should be required to report performance
metrics to a central registry, continuously and in near-real-time. When
performance drops below a defined threshold (e.g. current standard of care),
automatic alerts should trigger review and, if necessary, temporary suspension.
This requires investment in
infrastructure. It also requires that healthcare institutions — not just
developers — are held accountable for monitoring the systems they deploy. The
Penn Medicine and Yale cases demonstrate that performance decay is not a hypothetical
risk; it is a documented pattern that goes unaddressed in the absence of
mandatory monitoring.
3. Algorithmic Bias Audits as a Condition of Market Authorization
Bias in training data is not a bug — it is
a predictable consequence of using datasets that reflect historical inequities
in healthcare. The solution is not to wait for bias to be discovered
post-deployment; it is to require developers to demonstrate performance equity
across demographic subgroups as a precondition of regulatory authorization.
Independent third-party audits of training
data composition and model performance should be mandatory. Developers who
cannot demonstrate equitable performance across sex, age, ethnicity, and socioeconomic
status should not receive market clearance.
4. Real Accountability: Liability That Follows the Algorithm
One of the most significant regulatory
gaps in healthcare AI is the question of liability. When an AI system causes
patient harm, the current legal landscape in most jurisdictions is ambiguous at
best. Developers argue that clinicians are responsible for the decisions they
make, even when those decisions are AI-assisted. Clinicians argue that they
trusted a CE-marked or FDA-cleared system. The patient is harmed, and no one is
held accountable.
Regulatory frameworks must establish
clear, non-waivable liability for developers when their systems cause harm
attributable to design failure, inadequate validation, or the deployment of a
system with known performance limitations in the affected patient group. This
is not punitive; it is the basic condition that makes safety a commercial
priority.
5. An Immediate Moratorium on Unvalidated LLMs in Clinical Contexts
General-purpose large language models are
not medical devices. They have not been validated for clinical use. They
hallucinate, they sycophantically comply with dangerous prompts, they omit
critical information, and they have caused direct patient harm. The regulatory
fiction that an LLM providing medical advice is somehow different from a
medical device simply because it is 'general purpose' must end.
Any LLM-based application that provides
medical advice, diagnostic suggestions, treatment recommendations, or mental
health support to the public should be regulated as a medical device, with all
the attendant requirements for clinical validation, post-market surveillance,
and adverse event reporting.
6. Independent Evaluation Infrastructure: IAEA for AI
The International Atomic Energy Agency
provides a model for what is needed at the international level: an independent
body with the authority, expertise, and resources to evaluate AI health
technologies on the basis of clinical and cost-effectiveness evidence, and to
issue binding guidance on which systems may be deployed in clinical practice.
The current landscape, in which developers
self-report performance metrics, regulators rely on pre-market submissions, and
post-market evidence accumulates through adverse event reports and/or news
articles, is not fit for purpose. Independent evaluation, conducted by teams
with no financial relationship with the developers they assess, is the minimum
standard that patients deserve.
7. Mandatory Transparency and Explainability
Clinicians cannot meaningfully exercise
professional judgement over an AI recommendation they cannot interrogate. The
current practice of deploying black-box models into clinical environments,
where clinicians see only the output, not the reasoning, creates the conditions
for automation bias: the documented tendency of clinicians to accept AI
recommendations even when those recommendations are wrong and their own initial
assessment was correct.
Regulators should require that AI systems
deployed in high-stakes clinical contexts provide meaningful explanations of
their outputs — explanations that are actionable, not merely decorative.
Research has shown that poorly designed XAI (explainable AI) can paradoxically
increase automation bias. The standard must be not 'some explanation' but 'an
explanation that meaningfully supports critical appraisal'.
Conclusion: The Chernobyl Threshold
The Chernobyl disaster did not happen
because the engineers did not know that reactors could explode. It happened
because the institutional, political, and commercial incentives were all
aligned against saying so. Safety concerns were suppressed. Warning systems
were overridden. The individuals who raised objections were overruled. And
then, at 1:23 am on 26 April 1986, the consequences became undeniable.
Healthcare AI is not at 1:23 am yet.
People have been harmed. Some have died. The adverse event reports are
accumulating. The recall rates are double what they should be. The bias data is
published in peer-reviewed journals. The zombie algorithms are degrading in
hospitals right now. But we have not yet had the event that forces the
political reckoning.
"We do not need a Chernobyl. We have the evidence. We have the
framework. What we need is the will to act before the disaster that makes
inaction impossible."
The question is whether we will wait for
one.
The case for meaningful, mandatory
regulation of healthcare AI is not speculative. It is made by the patients in
this report and their families who are pursuing wrongful death claims against a
ghost in the machine.
The AI industry has had its period of
voluntary self-governance. It has not used that period well. Regulation is not
the enemy of innovation — it is the condition under which innovation becomes
worthy of the name. A medical technology that cannot demonstrate safety and
equity across the population it serves is not an innovation. It is an
experiment conducted without consent.
The anniversary of Chernobyl is a useful
moment to reflect on what happens when the gap between claimed performance and
actual performance is allowed to widen until it becomes catastrophic. We know
the lessons. The question is whether we will apply them.
Works cited
1. Gichoya JW, Thomas K, Celi
LA, et al. AI pitfalls and what not to do: mitigating bias in AI. Br J Radiol. 2023;96(1150):20230023.
doi:10.1259/bjr.20230023
2. Obermeyer Z, Powers B, Vogeli
C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the
health of populations. Science. 2019;366(6464):447-453. doi:10.1126/science.aax2342
3. AI was meant to cut
health care costs. It turns out to need expensive human support By Darius TahirJan 11, 2025 https://www.sfchronicle.com/health/article/ai-health-care-needs-costly-human-oversight-20028092.php
4. Yuzhe Yang et al., Demographic
bias of expert-level vision-language foundation models in medical imaging. Sci.
Adv.11,eadq0305(2025).DOI:10.1126/sciadv.adq0305
5. AI in the Operating
Room: Promise, Peril, and the Regulation Gap - Doctor Trusted, accessed on May
5, 2026, https://insights.wchsb.com/2026/03/05/ai-in-the-operating-room-promise-peril-and-the-regulation-gap/
6. Medical Malpractice
in 2025: How AI in Healthcare Is Changing Lawsuits, accessed on May 5, 2026, https://www.brandonjbroderick.com/medical-malpractice-2025-how-ai-healthcare-changing-lawsuits
7. Medicare Advantage
Algorithm-Denied Claims Lawsuits, accessed on May 5, 2026, https://www.classaction.org/healthcare-algorithm-scandal-lawsuit
8. Algorithms Deny
Humans Health Care - The Regulatory Review, accessed on May 5, 2026, https://www.theregreview.org/2025/03/18/phillips-algorithms-deny-humans-health-care/
9. Health care
prediction algorithm biased against black patients, study finds - UChicago
News, accessed on May 5, 2026, https://news.uchicago.edu/story/health-care-prediction-algorithm-biased-against-black-patients-study-finds
10. Underdiagnosis bias
of artificial intelligence algorithms applied to chest radiographs in
under-served patient populations - PMC, accessed on May 5, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC8674135/
11. The Growing Use of
Artificial Intelligence in Health Care and Implications for Disparities,
accessed on May 5, 2026, https://www.kff.org/racial-equity-and-health-policy/the-growing-use-of-artificial-intelligence-in-health-care-and-implications-for-disparities/
12. The Impact of AI on
DEI in Laboratory Medicine - ASCLS, accessed on May 5, 2026, https://ascls.org/the-impact-of-ai-on-dei-in-laboratory-medicine/
13. Study reveals why
AI models that analyze medical images can be biased | MIT News, accessed on May
5, 2026, https://news.mit.edu/2024/study-reveals-why-ai-analyzed-medical-images-can-be-biased-0628
14. Case Study 20: The
$4 Billion AI Failure of IBM Watson for Oncology - Henrico Dolfing,
accessed on May 5, 2026, https://www.henricodolfing.ch/en/case-study-20-the-4-billion-ai-failure-of-ibm-watson-for-oncology/
15. What Happened to
IBM Watson: The Rise, Fall, and Rebirth of AI's Most Hyped Technology, accessed
on May 5, 2026, https://medium.com/@averageguymedianow/what-happened-to-ibm-watson-the-rise-fall-and-rebirth-of-ais-most-hyped-technology-28399bb39782
16. M. D. Anderson
Breaks With IBM Watson, Raising Questions About
Artificial Intelligence in Oncology - PubMed, accessed on May 5, 2026, https://pubmed.ncbi.nlm.nih.gov/30053147/
17. As AI Enters
Surgery, Reports Mount of Complications and Mistakes - Modern Diplomacy,
accessed on May 5, 2026, https://moderndiplomacy.eu/2026/02/14/as-ai-enters-surgery-reports-mount-of-complications-and-mistakes/
18. Dangerous Medical
Device Recall: The TruDi Navigation System Defect |
2/20/2026, accessed on May 5, 2026, https://www.forthepeople.com/blog/dangerous-medical-device-recall-trudi-navigation-system-defect/
19. Experts sound alarm
as reports of botched surgeries and misidentified body parts arise:
'Inconsistent, inaccurate, and unreliable' - The Cool Down, accessed on May 5,
2026, https://www.thecooldown.com/green-tech/ai-surgery-fda-regulatory-reports/
20. Class 2 Device
Recall TruDi Navigation System - accessdata.fda.gov,
accessed on May 5, 2026, https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfres/res.cfm?id=216327
21. Fresenius Dialysis
Machine Lawsuit [2026 Investigation] - TorHoerman
Law, accessed on May 5, 2026, https://www.torhoermanlaw.com/fresenius-dialysis-machine-lawsuit/
22. St. Jude
Defibrillator & Therapy Device Lawsuits | Levin Law, accessed on May 5,
2026, https://levinlaw.com/st-jude-ict-crd-lawsuit/
23. Premature battery
depletion with St. Jude Medical ICD and CRT-D devices. Indian Heart Rhythm
Society guidelines for physicians - PMC, accessed on May 5, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC5219839/
24. A Case of Bromism
Influenced by Use of Artificial Intelligence - ACP Journals, accessed on May 5,
2026, https://www.acpjournals.org/doi/pdf/10.7326/aimcc.2024.1260
25. Journal warns
against using ChatGPT for health after a man develops rare condition, accessed
on May 5, 2026, https://www.htworld.co.uk/news/journal-warns-against-using-chatgpt-for-health-after-a-man-develops-rare-condition-digi25/
26. A Case of Bromism
Influenced by Use of Artificial Intelligence, accessed on May 5, 2026, https://www.acpjournals.org/doi/10.7326/aimcc.2024.1260
27. Study: AI Generates
'Severe' Errors in 22% of Medical Cases - Burns & Wilcox, accessed on May
5, 2026, https://www.burnsandwilcox.com/insights/study-ai-generates-severe-errors-in-22-of-medical-cases/
28. Medical errors in
large language models revealed using 1000 synthetic clinical transcripts,
accessed on May 5, 2026, https://www.medrxiv.org/content/10.64898/2026.03.23.26349082v1.full-text
29. Medical errors in
large language models revealed using 1000 synthetic clinical transcripts,
accessed on May 5, 2026, https://www.researchgate.net/publication/403148544_Medical_errors_in_large_language_models_revealed_using_1000_synthetic_clinical_transcripts
30. Who is Responsible
When AI Makes a Medical Mistake? - Sermo, accessed on
May 5, 2026, https://www.sermo.com/resources/who-is-responsible-when-ai-makes-a-medical-mistake/
31. Judge orders
UnitedHealth to hand over documents in AI coverage denial case, accessed on May
5, 2026, https://www.beckerspayer.com/legal/judge-orders-unitedhealth-to-hand-over-broad-discovery-in-ai-coverage-denial-case/
32. Medicare advantage
becoming a disadvantage with use of artificial ..., accessed on May 5, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12979811/
33. Estate of Gene B. Lokken et al. v. UnitedHealth Group, Inc. et al., accessed
on May 5, 2026, https://litigationtracker.law.georgetown.edu/litigation/estate-of-gene-b-lokken-the-et-al-v-unitedhealth-group-inc-et-al/
34. UnitedHealth uses
faulty AI to deny elderly patients medically necessary coverage, lawsuit claims
- CBS News, accessed on May 5, 2026, https://www.cbsnews.com/news/unitedhealth-lawsuit-ai-deny-claims-medicare-advantage-health-insurance-denials/
35. The AI Arms Race In Health Insurance Utilization Review: Promises ...,
accessed on May 5, 2026, https://www.healthaffairs.org/doi/10.1377/hlthaff.2025.00897
36. How to fix the $2.6
billion drug discovery problem - DrugPatentWatch,
accessed on May 5, 2026, https://www.drugpatentwatch.com/blog/how-to-fix-the-2-6-billion-drug-discovery-problem/
37. AI in Drug
Discovery: The Illusion of Speed and the Reality of Clinical Failure - Infiuss Health, accessed on May 5, 2026, https://infiuss.com/insights/ai-in-drug-discovery-the-illusion-of-speed-and-the-reality-of-clinical-failure
38. Artificial
Intelligence in Small-Molecule Drug Discovery: A Critical Review of Methods,
Applications, and Real-World Outcomes - PMC, accessed on May 5, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12472608/
39. Clinician in the
loop: a flawed solution for AI oversight | The BMJ, accessed on May 5, 2026, https://www.bmj.com/content/393/bmj-2025-089213
40. What If AI Makes a
Mistake in My Patient's Care—Am I Liable? - Residency Advisor, accessed on May
5, 2026, https://residencyadvisor.com/resources/medical-technology-advancements/what-if-ai-makes-a-mistake-in-my-patients-caream-i-liable
41. Artificial
intelligence in medicine and the negative outcome penalty paradox, accessed on
May 5, 2026, https://jme.bmj.com/content/51/1/34
42. Who's Liable When
AI Gets It Wrong? A New Twist on Medical Malpractice | Western Summit, accessed
on May 5, 2026, https://www.western-summit.com/blogs/when-ai-gets-it-wrong
43. Is Watson for
Oncology per se Unreasonably Dangerous?: Making A Case for How to Prove
Products Liability Based on a Flawed Artificial Intelligence Design | American
Journal of Law & Medicine - Cambridge University Press & Assessment,
accessed on May 5, 2026, https://www.cambridge.org/core/journals/american-journal-of-law-and-medicine/article/is-watson-for-oncology-per-se-unreasonably-dangerous-making-a-case-for-how-to-prove-products-liability-based-on-a-flawed-artificial-intelligence-design/EA125D82A90E6D67E968D33FDE041659
44. Automation Bias and
Clinical Practice: FDA Makes Incremental Updates to Clinical Decision Support
Software Guidance | Cooley LLP - JD Supra, accessed on May 5, 2026, https://www.jdsupra.com/legalnews/automation-bias-and-clinical-practice-3711489/
45. IBM pitched its Watson supercomputer as a
revolution in cancer care. It’s nowhere close https://www.statnews.com/2017/09/05/watson-ibm-cancer/
46. As AI enters the
operating room, reports arise of botched surgeries and misidentified body parts
https://www.reuters.com/investigations/ai-enters-operating-room-reports-arise-botched-surgeries-misidentified-body-2026-02-09/
47. AI in the operating
room: Reports of botched surgeries, misidentified body parts rise, Jaimi Dowdell, Steve Stecklow,
Chad Terhune and Rachael Levy https://www.staradvertiser.com/2026/02/09/breaking-news/ai-in-the-operating-room-reports-of-botched-surgeries-misidentified-body-parts-rise/
48. Audrey Eichenberger,
Stephen Thielke, Adam Van Buskirk. A Case of
Bromism Influenced by Use of Artificial Intelligence. AIM Clinical Cases.2025;4:e241260. [Epub 5 August 2025]. doi:10.7326/aimcc.2024.1260
49. Wu, D., Haredasht, F. N., Maharaj, S. K., Jain, P., Tran, J., Gwiazdon, M., ... & Goh, E. (2025). First, do NOHARM:
towards clinically safe large language models. arXiv
preprint arXiv:2512.01241.
50. “ChatGPT is not your
doctor, dietitian, or therapist”. Why we urgently need safety evaluation
standards for generative AI in health, but who will take the lead? By Alex Ruani https://blogs.bmj.com/bmjleader/2025/09/19/chatgpt-is-not-your-doctor-dietitian-or-therapist-why-we-urgently-need-safety-evaluation-standards-for-generative-ai-in-health-but-who-will-take-the-lead/
51. When Therapy Chatbots
Gaslight: Who Holds AI Accountable for Clinical Harm? https://www.progressivetherapeutic.com.au/ai-therapy/chatbot-gaslight
52. Sharma, S., Alaa, A.M. & Daneshjou, R. A longitudinal analysis of declining medical
safety messaging in generative AI models. npj Digit.
Med. 8, 592 (2025). https://doi.org/10.1038/s41746-025-01943-1
53. AI has not improved the success rate of drug development, the first batch of AI-designed clinical trial
results are disappointing https://news.gbimonthly.com/tw/invest/show.php?num=63151
54. Haller S, Hedderich
D, Federau C, et al. The Current Status of
AI-accelerated MRI Techniques in Clinical Use. Radiology. 2025;317(2):e243819. doi:10.1148/radiol.243819
55. D. Saleela, H.
Rasheed, T. Joseph, P. S. Kurup, C. M S and A. R. Panicker,
"Understanding MRI Drift: Impact on Imaging Quality and Advances in
AI-Driven Correction for Clinical Reliability," 2025 Second International
Conference on Cognitive Robotics and Intelligent Systems (ICC - ROBINS),
Coimbatore, India, 2025, pp. 388-395, doi:
10.1109/ICC-ROBINS64345.2025.11086217,
56. Lee B,
Kramer P, Sandri S, et al. Early Recalls and Clinical Validation Gaps in
Artificial Intelligence–Enabled Medical Devices. JAMA Health Forum.
2025;6(8):e253172.
doi:10.1001/jamahealthforum.2025.3172
