The UK can blame its bad immigration data on Hungary, one of the eight countries which joined the European Union in 2004. Unlike most existing EU countries, the UK government allowed its citizens to move and work without restrictions, expecting 5,000 to 13,000 people to arrive each year. But this was a massive underestimate, causing accusations that immigration was out of control and arguably contributing to Britain’s exit from the EU.

Based on the results of the 2021 Census, the country which sent the most people to the UK was Poland, followed by Romania. But Hungary is the home of budget airline Wizz Air, which as part of keeping down costs tends to use smaller airports such as Luton, Birmingham and Sheffield Doncaster.

Also to keep down costs, the International Passenger Survey run by the Office for National Statistics (ONS) at the time focused on Heathrow, Gatwick and Manchester. As a result, it didn’t notice increasing numbers of eastern Europeans using budget flights run by Wizz Air and others.

Georgina Sturge, a statistician for the House of Commons Library research service, highlights the episode in her new book, Bad data, as an example of how data collection can go awry. The passenger survey had been set up in the 1960s, when far fewer people travelled internationally, more left the UK permanently than arrived, and most people required visas.

“People didn’t tend to travel in large droves from Poznań to Doncaster in the past,” says Sturge. “Unfortunately for the statisticians, who hadn’t even stationed anyone there to do the survey at the time, that was exactly what people started to do.”

Sturge says the UK has excellent official data in some areas, including health, traffic accident statistics and much of the ONS’s output. The Office for Statistics Regulation maintains a list of approved national statistics which she describes as the gold standard.

“But ultimately, if we’re asked a question or we need to produce some briefing material on something and there is any data out there which seems remotely reliable, we will pretty much end up using it,” she says of her work for MPs and their staff. “From our perspective, it’s about explaining the caveats.” This means thinking about where data comes from, how it is collected and for what purpose, considering the human processes involved rather than just the technical matter of getting hold of it.

Replication crisis Parliamentarians are not alone in being hungry for data, and not too picky about what they consume. Recent years have seen several scientific fields threatened by a replication crisis, where the results of research published in peer-reviewed journals cannot be reproduced by others repeating the work, in some cases because the data has errors or is faked. Researchers who rely on such research data may find their work is undermined, but the risk can be lessened by using services that carry out reliability checks on papers. Healthcare journalist and academic Ivan Oransky co-founded Retraction Watch, a database of scientific papers that have been withdrawn. Its data is used by publishers and companies to check references through bibliographic management software including EndNote, Papers and Zotero, as well as digital library service Third Iron. “We would be happy to work with more, and to have our database integrated into the manuscript management systems that publishers use,” he says. However, he adds, the bigger problem lies in inaccurate papers and data that have not been retracted, making it worth using post-publication review services such as PubPeer, of which he is a volunteer director. More generally, he adds that researchers are well-advised to follow the Russian proverb, “trust, but verify”, adopted by former US president Ronald Reagan in nuclear disarmament talks with the Soviet Union. Researchers should aim to obtain and analyse the original data before relying on it for a project or further research. “That may seem inefficient, but it’s far better than being caught unaware when a project is much further along,” says Oransky. Another approach is to improve the classification of scientific data, particularly that held in text. Neal Dunkinson, vice-president of solutions and professional services for semantic analytics company SciBite, says the word “hedgehog” in a genetics paper may refer to the sonic hedgehog gene that helps control how bodies develop from embryos, named after the video game character, or it may refer to the small, spiny mammal in general. Cambridge-based SciBite, which was bought by Dutch scientific publisher Elsevier in 2020, has developed a service to automate the tagging mentions of 40,000 genes to standard identities, making searches of papers, slides and electronic lab notebooks more precise. To do so, it has built lists of acronyms, alternative names and spellings, and common misspellings. As well as applying it to existing material, it offers a real-time option that prompts researchers to add tags through drop-down lists or the equivalent of a spellchecker. Dunkinson says that good-quality data in life sciences should be “fair” – findable, accessible, interoperable and reusable. “We don’t at the moment critique the quality of the information written down – that’s about repeatability in the experimental process – but how usable is that information, is it tagged properly, is it stored correctly, do people know where it is, is it in the right formats,” he says.