OpenAI’s Whisper tool raises concerns by adding false information to medical records

OpenAI’s Whisper transcription tool, designed to convert speech to text, reportedly introduces inaccuracies or "hallucinations" in its transcriptions, causing concern in high-stakes sectors like healthcare.

Launched in 2022, Whisper has been noted to add fabricated details - such as racial descriptions or imaginary medications - in transcriptions, which is particularly troubling in doctor-patient consultations, where misinformation could potentially harm patients.

A report from the Associated Press revealed that Whisper’s automatic speech recognition (ASR) system has a high likelihood of generating inaccurate text.

Interviews with developers, software engineers, and researchers indicate that the tool can insert words that were never spoken, including sensitive or dangerous content. In one cited case, a simple sentence was distorted to include unrelated violent and racial descriptions. This hallucination issue often stems from background noise or pauses, which Whisper misinterprets as meaningful speech.

Whisper’s hallucinations present a unique risk because the tool is utilized by various applications in critical fields.

For instance, Paris-based Nabla has integrated Whisper into a transcription tool used by over 30,000 clinicians and 40 healthcare systems. With more than seven million transcriptions recorded, the inability to retain the original audio for verification raises concerns over correcting any potential errors.

The problem extends to accessibility tools for the deaf and hard-of-hearing, where accuracy is essential and verification can be challenging.

Reports suggest hallucinations may appear in as many as 80% of transcriptions, according to researchers, and have impacted every transcript in a sample set of 26,000. OpenAI, in response, acknowledges the need to improve Whisper's accuracy and reduce hallucinations, and they plan to use feedback to refine the model in future updates.

Originally, OpenAI described Whisper as having human-level accuracy with a robust ability to handle accents, noise, and technical jargon, but the ongoing hallucination issue shows there’s still room for improvement.

Tags: