A few months prior, a doctor demonstrated an AI transcription tool used to record and summarize patient meetings. While the outcome was satisfactory in this instance, researchers cited by ABC News have discovered that this is not always the case with OpenAI’s Whisper. This tool, employed by many hospitals, sometimes generates inaccurate information entirely. Whisper is utilized by a company called Nabla for a medical transcription tool, which, according to ABC News, has transcribed an estimated 7 million medical conversations. The service is reportedly used by more than 30,000 clinicians and 40 health systems. Nabla is aware of Whisper’s tendency to hallucinate and is working on addressing the issue.
A study conducted by researchers from Cornell University, the University of Washington, among others, found that Whisper produced hallucinations in about 1 percent of transcriptions. These hallucinations included entirely fabricated sentences with sometimes violent or nonsensical content during periods of silence in recordings. The researchers collected audio samples from TalkBank’s AphasiaBank for the study, pointing out that silence is particularly common when individuals with a language disorder called aphasia are speaking. One of the researchers, Allison Koenecke from Cornell University, shared examples from the study in a thread.
The study revealed that hallucinations sometimes involved imaginary medical conditions or phrases typical of YouTube content, such as “Thank you for watching!” (OpenAI reportedly utilized more than a million hours of YouTube videos to train GPT-4.) The researchers presented their findings in June at the Association for Computing Machinery FAccT conference in Brazil. It remains unclear whether the study has undergone peer review.
OpenAI spokesperson Taya Christianson issued a statement to The Verge via email, emphasizing the seriousness with which the company is addressing the problem and ongoing efforts to mitigate hallucinations. The usage policies for Whisper on OpenAI’s API platform prohibit its use in high-stakes decision-making contexts. Additionally, the model card for open-source use includes recommendations against deploying it in high-risk domains. OpenAI expressed gratitude towards the researchers for sharing their findings.