Large language models provide unreliable answers about public services, Open Data Institute finds

Research questions AI’s trustworthiness in giving people accurate information about government services

Popular large language models (LLMs) are unable to provide reliable information about key public services such as health, taxes and benefits, the Open Data Institute (ODI) has found.

Drawing on more than 22,000 LLM prompts designed to reflect the kind of questions people would ask artificial intelligence (AI)-powered chatbots, such as, “How do I apply for universal credit?”, the data raises concerns about whether chatbots can be trusted to give accurate information about government services.

The publication of the research follows the UK government’s announcement of partnerships with Meta and Anthropic at the end of January 2026 to develop AI-powered assistants for navigating public services.

“If language models are to be used safely in citizen-facing services, we need to understand where the technology can be trusted and where it cannot,” said Elena Simperl, the ODI’s director of research.

Responses from models – including Anthropic’s Claude-4.5-Haiku, Google’s Gemini-3-Flash and OpenAI’s ChatGPT-4o – were compared directly with official government sources. 

The results showed many correct answers, but also a significant variation in quality, particularly for specialised or less-common queries.

They also showed that chatbots rarely admitted when they didn’t know the answer to a question, and attempted to answer every query even when its responses were incomplete or wrong. 

Burying key facts

Chatbots also often provided lengthy responses that buried key facts or extended beyond the information available on government websites, increasing the risk of inaccuracy.

Meta’s Llama 3.1 8B stated that a court order is essential to add an ex-partner’s name to a child’s birth certificate. If followed, this advice would lead to unnecessary stress and financial cost. 

ChatGPT-OSS-20B incorrectly advised that a person caring for a child whose parents have died is only eligible for Guardian’s Allowance if they are the guardian of a child who has died. 

It also incorrectly stated that the applicant was ineligible if they received other benefits for the child. 

Simperl said that for citizens, the research highlights the importance of AI literacy, while for those designing public services, “it suggests caution in rushing towards large or expensive models, which emphasise the need for vendor lock-in, given how quickly the technology is developing. We also need more independent benchmarks, more public testing, and more research into how to make these systems produce precise and reliable answers.”

The second International AI safety report, published on 3 February, made similar findings regarding the reliability of AI-powered systems. Noting that while there have been improvements in recalling factual information since the 2025 safety report, “even leading models continue to give confident but incorrect answers at significant rates”.

Following incorrect advice

It also found highlighted users’ propensity to follow incorrect advice from automated systems generally, including chatbots, “because they overlook cues signalling errors or because they perceive the automation system as superior to their own judgement”.

The ODI’s research also challenges the idea that larger, more resource-intensive models are always a better fit for the public sector, with smaller models delivering comparable results at a lower cost than large, closed-source models such as ChatGPT in many cases.

Simperl warns governments should avoid locking themselves into long-term contracts when models temporarily outperform one another on price or benchmarks.

Commenting on the ODI’s research during a launch event, Andrew Dudfield, head of AI at Full Fact, highlighted that because the government’s position is pro-innovation, regulation is currently framed around principles rather than detailed rules.

“The UK may be adopting AI faster than it is learning how to use it, particularly when it comes to accountability,” he said.

Trustworthiness 

Dudfield noted that what makes this work compelling is that it focuses on real user needs, but that trustworthiness needs to be evaluated from the perspective of the person relying on the information, not from the perspective of demonstrating technical capability.

“The real risk is not only hallucination, but the extent to which people trust plausible-sounding responses,” she said.

Asked at the same event if the government should be building its own systems or relying on commercial tools, Richard Pope, researcher at the Bennett School of Public Policy, said the government needs “to be cautious about dependency and sovereignty”.

“AI projects should start small, grow gradually and share what they are learning,” he said, adding that public sector projects should prioritise learning and openness rather than rapid expansion.

Read more about artificial intelligence

Simperl highlighted that AI creates the potential to tailor information for different languages or levels of understanding, but that those opportunities “need to be shaped rather than left to develop without guidance”.

With new AI models launching every week, a January 2026 Gartner study found that the increasingly large volume of unverified and low-quality data generated by AI systems was a clear and present threat to the reliability of LLMs.

Large language models are trained on scraped data from the web, books, research papers and code repositories. While many of these sources already contain AI-generated data, at the current rate of expansion, they may all be populated with it. 

Highlighting how future LLMs will be trained more and more with outputs from current ones as the volume of AI-generated data grows, Gartner said there is a risk of models collapsing entirely under the accumulated weight of their own hallucinations and inaccurate realities. 

Managing vice-president Wan Fui Chan said that organisations could no longer implicitly trust data, or assume it was even generated by a human.

Chan added that as AI-generated data becomes more prevalent, regulatory requirements for verifying “AI-free” data will intensify in many regions.

Read more on Technology startups