Tech giant OpenAI has touted its AI transcription tool Whisper as having “near human-level power and accuracy.”
But Whisper has a major flaw: It tends to generate fragments of text or even complete sentences, according to interviews with more than a dozen software engineers, developers and academic researchers.
Some of the fabricated texts, known in the industry as hallucinations, could include racist comments, violent speeches and even imagined medical treatments, these experts said.
Experts said such fabrications are a problem because Whisper is used in a large number of industries around the world to translate and transcribe interviews, generate transcripts in popular consumer technologies and create subtitles for videos.
More worrying, they said, is the rush by medical centers to use Whisper-based tools to record patient consultations with doctors, despite warnings from OpenAI not to use the tool in “high-risk areas.”
It's difficult to pinpoint the full extent of the problem, but researchers and engineers said they frequently encountered Whisper hallucinations in their work.
For example, a researcher at the University of Michigan who conducted a study on public meetings said he found hallucinations in eight out of 10 audio transcripts he examined, before he began trying to improve the model.
A machine learning engineer said he initially detected hallucinations in about half of the more than 100 hours of Whisper transcripts he analyzed.
A third developer said he found a hallucination in almost every one of the 26,000 scripts he created with Whisper.
Problems persist even with short, well-recorded audio samples. A recent study by computer scientists revealed 187 hallucinations in more than 13,000 clear audio clips they examined.
Researchers said this trend could lead to tens of thousands of fake copies of millions of recordings.
Such mistakes can have “really serious consequences,” especially in hospitals, said Alondra Nelson, who led the White House Office of Science and Technology Policy during the Biden administration until last year.
“No one wants a wrong diagnosis,” said Nelson, a professor at the Institute for Advanced Study in Princeton, New Jersey.
“There has to be a higher bar.”
The prevalence of these types of hallucinations has led experts, advocates, and former OpenAI employees to call on the federal government to consider regulating AI.
They said OpenAI at least needs to address the issues.
“This seems solvable if the company is willing to prioritize it,” said William Saunders, a San Francisco-based research engineer who resigned from OpenAI in February over concerns about the company's direction. “It's a problem if you mention this and people get too confident in what it can do and integrate it into all these other systems.”
An OpenAI spokesperson said the company is constantly studying how to reduce hallucinations and expressed gratitude for the researchers' findings, adding that OpenAI is incorporating feedback into model updates.
While most developers assume that transcription tools misspell words or make other mistakes, engineers and researchers said they had never seen another AI transcription tool blow up like Whisper.
Professors Allison Koenke of Cornell University and Mona Sloan of the University of Virginia examined thousands of short clips they obtained from TalkBank, a research repository hosted at Carnegie Mellon University.
They found that about 40% of hallucinations were harmful or disturbing because the speaker could be misinterpreted or distorted.
In one example they discovered, a speaker said, “The boy, I'm not sure exactly, was going to take the umbrella.”
But the transcription program adds: “He took a big piece of the cross, a very small piece… I'm sure he didn't have a terrorist knife, so he killed several people.”
Researchers aren't sure what causes the delirium that affects Whisper and similar tools, but software developers said the delirium tends to occur amid pauses, background sounds, or music playback.