
Click the link below the picture
.
Imagine calling the Social Security Administration and asking, “Where is my April payment?” only to have a chatbot respond, “Canceling all future payments.” Your check has just fallen victim to “hallucination,” a phenomenon in which an automatic speech recognition system outputs text that bears little or no relation to the input.
Hallucinations are one of the many issues that plague so-called generative artificial intelligence systems like OpenAI’s ChatGPT, xAI’s Grok, Anthropic’s Claude or Meta’s Llama. These are design flaws, problems in the architecture of these systems, that make them problematic. Yet these are the same types of generative AI tools that the DOGE and the Trump administration want to use to replace, in one official’s words, “the human workforce with machines.”
This is terrifying. There is no “one weird trick” that removes experts and creates miracle machines that can do everything that humans can do, but better. The prospect of replacing federal workers who handle critical tasks, ones that could result in life-and-death scenarios for hundreds of millions of people, with automated systems that can’t even perform basic speech-to-text transcription without making up large swaths of text, is catastrophic. If these automated systems can’t even reliably parrot back the exact information that is given to them, then their outputs will be riddled with errors, leading to inappropriate and even dangerous actions. Automated systems cannot be trusted to make decisions the way that federal workers, actual people, can.
Historically, “hallucination” hasn’t been a major issue in speech recognition. That is, although earlier systems could take specific phrases and respond with transcription errors in specific phrases or misspell words, they didn’t produce large chunks of fluent and grammatically correct texts that weren’t uttered in the corresponding audio inputs. But researchers have shown that recent speech recognition systems like OpenAI’s Whisper can produce entirely fabricated transcriptions. Whisper is a model that has been integrated into some versions of ChatGPT, OpenAI’s famous chatbot.
For example, researchers from four universities analyzed short snippets of audio transcribed by Whisper, and found completely fabricated sentences, with some transcripts inventing the races of the people being spoken about, and others even attributing murder to them. In one case a recording that said, “He, the boy, was going to, I’m not sure exactly, take the umbrella” was transcribed with additions including: “He took a big piece of a cross, a teeny, small piece…. I’m sure he didn’t have a terror knife, so he killed a number of people.” In another example, “two other girls and one lady” was transcribed as “two other girls and one lady, um, which were Black.”
In the age of unbridled AI hype, with the likes of Elon Musk claiming to build a “maximally truth-seeking AI,” how did we come to have less reliable speech recognition systems than we did before? The answer is that while researchers working to improve speech recognition systems used their contextual knowledge to create models uniquely appropriate for performing that specific task, companies like OpenAI and xAI are claiming that they are building something akin to “one model for everything” that can perform many tasks, including, according to OpenAI, “tackling complex problems in science, coding, math, and similar fields.” To do this, these companies use model architectures that they believe can be used for many different tasks and train these models on vast amounts of noisy, uncurated data, instead of using system architectures and training and evaluation datasets that best fit a specific task at hand. A tool that supposedly does everything won’t be able to do it well.
The current dominant method of building tools like ChatGPT or Grok, which are advertised along the lines of “one model for everything,” uses some variation of large language models (LLMs), which are trained to predict the most likely sequences of words. Whisper simultaneously maps the input speech to text and predicts what immediately comes next, a “token” as output. A token is a basic unit of text, such as a word, number, punctuation mark, or word segment, used to analyze textual data. So, giving the system two disparate jobs to do, speech transcription and next-token prediction, in conjunction with the large, messy datasets used to train it, makes it more likely that hallucinations will happen.
.
Moor Studio/Getty Images
.
.
Click the link below for the complete article:
.
__________________________________________
Leave a comment