Subscribe

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Service

OpenAI Latest AI Models Show Higher Hallucination Rates

OpenAI ChatGPT OpenAI ChatGPT
IMAGE CREDITS: TELEGRAPH

OpenAI latest AI models, o3 and o4-mini, are pushing technical boundaries in several areas. Yet despite their advancements, they come with a troubling drawback — they hallucinate more often than older models. These hallucinations, where the models generate false or misleading information, represent a long-standing and unresolved challenge in artificial intelligence.

Hallucinations remain a major obstacle even for the most capable systems. Historically, newer models have shown small improvements in accuracy. However, OpenAI’s internal testing reveals that o3 and o4-mini hallucinate more frequently than their predecessors — a reversal in progress that has raised eyebrows.

The issue becomes more concerning considering these models are branded as “reasoning models.” In theory, they should handle complex tasks more logically and with greater precision. Yet, according to OpenAI’s technical report, both o3 and o4-mini produce more hallucinated responses than o1, o1-mini, and o3-mini, and even more than GPT-4o, which isn’t classified as a reasoning model.

Accuracy Gains Come with a Tradeoff

OpenAI suggests that the increased hallucination rate might stem from the nature of reasoning models themselves. These models generate more claims overall, meaning that while they may produce more accurate insights, they also produce more inaccuracies. The effect is most visible in PersonQA, OpenAI’s internal benchmark that tests models on their knowledge of individuals.

In PersonQA evaluations, o3 hallucinated 33% of the time. That’s double the hallucination rate of o1 (16%) and more than double that of o3-mini (14.8%). O4-mini performed even worse, hallucinating on 48% of PersonQA questions.

External researchers have found similar issues. Transluce, a nonprofit AI lab, observed o3 falsely describing actions it never performed. In one case, o3 claimed it had run code on a 2021 MacBook Pro outside of ChatGPT and copied the results into its response — something the model isn’t even capable of doing.

Neil Chowdhury, a researcher at Transluce and former OpenAI employee, speculated on the cause. “Our hypothesis is that the kind of reinforcement learning used for o-series models may amplify issues that are usually mitigated (but not fully erased) by standard post-training pipelines,” he explained.

Sarah Schwettmann, co-founder of Transluce, added that o3’s hallucination tendency may undercut its practical value.

Still, some in the industry are finding reasons to use the models. Kian Katanforoosh, a Stanford adjunct professor and CEO of AI training company Workera, said his team is actively testing o3 for code-related tasks. “It’s a step above the competition,” he noted. But he also confirmed that hallucinations occur — especially in the form of broken website links that o3 invents and presents as real.

These types of hallucinations are especially problematic in business applications that demand high accuracy. In industries like legal services, healthcare, or finance, incorrect information can result in serious consequences. For example, no law firm would want a model that injects fake legal citations or incorrect client data into a contract.

Searching for a Solution

One promising way to improve model accuracy is through web search integration. OpenAI’s GPT-4o with search capability achieves 90% accuracy on SimpleQA, another internal benchmark. If the reasoning models could similarly leverage real-time search, it might help reduce hallucinations — though privacy concerns may arise, especially if prompts are exposed to third-party services.

The broader AI community has shifted its focus toward reasoning models in the past year. That’s largely due to diminishing returns from scaling traditional models. Reasoning models promise better results with less computational cost. But if scaling these models also increases hallucinations, it poses a serious dilemma for the field.

For now, OpenAI acknowledges the problem but offers no concrete solution. “Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” said OpenAI spokesperson Niko Felix in an email.

Solving the hallucination issue has become more urgent. As AI plays a larger role in sensitive and high-stakes environments, users will need to trust the information these systems provide. That trust is harder to build when the best-performing models are also the most likely to get the facts wrong.

Share with others