AI Tool Hallucinations in Medical Transcriptions

New AI tool to help doctos is “hallucinating”

24 views

OpenAI’s speech-to-text tool Whisper, introduced two years ago, is now widely used by healthcare AI company Nabla, which relies on it to transcribe medical conversations for its 45,000 clinicians across more than 85 organizations, including the University of Iowa Health Care.

Recent research, however, has raised concerns about Whisper’s reliability. Studies have found that the tool sometimes “hallucinates” — generating statements that were never spoken, potentially posing risks in medical settings. A researcher from the University of Michigan found hallucinations in 80% of Whisper transcriptions, while an unnamed developer encountered similar issues in half of more than 100 hours of transcriptions. Another engineer reported inaccuracies in almost all of 26,000 transcripts processed by Whisper.

Alondra Nelson, a professor at the Institute for Advanced Study in Princeton, NJ, told the Associated Press that these transcription errors could have “really grave consequences” in medical environments, as “nobody wants a misdiagnosis.”

Earlier this year, researchers from Cornell, NYU, the University of Washington, and the University of Virginia examined Whisper’s hallucination frequency using 13,140 audio segments (each averaging 10 seconds) from TalkBank’s AphasiaBank, a database that includes speech samples from people with aphasia, a language disorder. Their findings showed 312 instances of fully hallucinated phrases or sentences that were not present in the original audio. Furthermore, 38% of these hallucinations included harmful or stereotypical language irrelevant to the context.

The study also suggests that Whisper may exhibit a “hallucination bias,” potentially adding errors more frequently for certain groups, such as individuals with aphasia, those with speech disorders like dysphonia, elderly speakers, or non-native language speakers. The researchers noted, “Our findings indicate that hallucination bias could arise for any demographic with speech impairments, affecting accuracy disproportionately.”