A.I. Hallucinations Are Getting Worse, Even as New Systems Become More Powerful

10 Min Read

The last months, a bot of AI that manages the technical support for cursor, a promising tool for computer programmers, alerted several clients about a change in the company’s policy. He said they were no longer allowed to use the cursor in more than a single computer.

In angry publications to Internet message boards, customers complained. Some canceled their cursor accounts. And some became more angry when they realized what had happened: the bot ai had announced a change of policy that did not exist.

“We do not have that policy. Of course, it is free to use the cursor on multiple machines,” wrote the executive and co -founder director of the company, Michael Tuelll, to a Reddit publication. “Unfortunately, this is an incorrect response of a first -line IA support bone.”

More than two years after the arrival of Chatgpt, technology companies, office workers and daily consumers are using AI bots for a wide variety of tasks. But there is still no way to ensure that these systems produce precise information.

The newest and most powerful reasoning systems, as calls for reasoning of companies such as Openai, Google and the new Chinese company, Depseek-, are generating more errors, no less. As their mathematical skills have not improved, its management of the events has more agitated hooks. It is not entirely clear why.

Today’s bots are based on complex mathematical systems that learn their skills when analyzing huge amounts of digital data. They are not, and they cannot, decide what is true and what is false. Sometimes, they simply create things, a phenomenon that some researchers of AI call hallucinations. In a test, the hallucination rates of the newer AI systems were so high axis of 79 percent.

These systems use mathematical probabilities to guess the best response, not a strict set of rules defined by human engineers. Then they make a certain number of mistakes. “Despite our best efforts, they will always hallucinate,” said Amr Awadallah, executive director of Vecara, a new company that builds artificial intelligence tools for companies and a former Google executive. “That will never disappear.”

For several years, this phenomenon has raised conerns about the reliability of thesis systems. He thought they are useful in some situations, such as writing documents, summarizing office documents and generating computer code, their errors can cause problems.

The bots linked to search engines such as Google and Bing sometimes generate search results that are ridiculously incorrect. If you ask for a good marathon on the west coast, they could suggest a career in Philadelphia. If you are told the number of households in Illinois, they could cite a source that does not include that information.

These hallucinations may not be a big problem for many people, but it is a serious problem for anyone who uses technology with judicial documents, medical information or confidential commercial data.

“You spend a lot of time trying to find out what response are objectives and which are not,” said Pratik Verma, co -founder and executive director of Okahu, a company that helps companies navigate the hallucination problem. “Do not deal with these errors correctly, basically, eliminates the value of AI systems, which are supposed to be automatic tasks for you.”

Cursor and Mr. Truell did not respond to requests for comments.

For more than two years, companies like OpenAi and Google constantly improved their AI systems and reduced the frequency of thesis errors. But with the use of new reasoning systems, errors are increasing. The latest Operai systems hallucinate at a higher rate than the company’s previous system, according to the company’s own tests.

The company discovered that O3, its most powerful system, hallucinated 33 percent of the time when executing its Personqa reference test, which implies answering questions about public figures. That is more than twice the hallucination rate of the OpenAi anterior reasoning system, called O1. The new O4-MINI hallucinated at an even higher rate: 48 percent.

When another test called simpleqa is executed, which asks more general questions, hallucination rates for O3 and O4-mini were 51 percent and 79 percent. The previous system, O1, hallucinated 44 percent of the time.

In an article that details the evidence, Openai said that more research was needed to understand the cause of these results. Because AI systems learn from more data from which people can wrap their heads, technologists fight to determine why they behave in the ways in which they do.

“Hallucinations are not inherently more frequent in reasoning models, we think we are actively working to reduce the highest hallucination rates we saw in O3 and O4-mini,” said a company spokeswoman Gaby Raila. “We will continue our research on hallucinations in all models to improve precision and reliability.”

Hanneh Hajishirzi, professor at the University of Washington and researcher at the Alles Institute of Artificial Intelligence, is part of a team that recently devised a way to track the behavior of a system to the individual data in which it was trained. But because the systems learn from so many data, and because they can generate almost anything, this new tool cannot explain everything. “We still know that these models work exactly,” he said.

The evidence of independent companies and researchers indicate that hallucination rates are also increasing to reallocate models of companies such as Google and Deepseek.

Since the end of 2023, Mr. Awadallah’s company, has tracked how often the chatbots fade away from the truth. The company asks these systems to perform a direct task that is easily verified: summary specific news articles. Just then, chatbots persistently invent information.

The original Vacerara investigation estimated that in this situation the chatbots constituted information at least 3 percent of the time and sometimes up to 27 percent.

In the year and a half since then, companies like OpenAi and Google pushed those numbers towards the range of 1 or 2 percent. Others, such as the new San Francisco Anthropic company, were around 4 percent. But hallucination rates in this test have increased with reasons communication systems. The Deepseek reasoning system, R1, hallucinated 14.3 percent of the time. OPENAI O3 rose to 6.8.

(The New York Times has sued Openai and his partner, Microsoft, accusing the issue of copyright infringement regarding the content of news related to AI systems. Openai and Microsoft have denied those statements).

For years, companies like OpenAi relax in a simple concept: the more Internet data feed their artificial intelligence systems, these systems would better work. But they used almost the entire English text on the Internet, which meant that they needed a new way to improve their chatbots.

Therefore, these companies incline heavier in a technique that scientists call reinforcement learning. With this process, a system can learn behavior through proof and error. It works well in certain areas, such as mathematics and computer programming. But it is falling short in other areas.

“The way these systems are trained, will begin to focus on a task and begin to forget others,” said Laura Pérez-Beltrachini, a researcher at the University of Edinburgh who is between a team that closely examines the problem of hallucination.

Another problem is that reasoning models are designed to spend time “thinking” through complex problems before deciding on an answer. As they try to address a step -by -step problem, they run the risk of hallucinating at each step. Errors can be exhausted as they spend more time thinking.

The latest bots reveal every step to users, which users can also see each error. Researchers have also found that in many cases, the steps that show a bot are not related to the response it offers.

“What the system says that it is thinking is not necessarily what you are thinking,” said Aryo Pradipta Gema, an IA researcher at the University of Edinburgh and a member of Anthrope.

About The Author