tony_7840 - Fotolia

Hadoop starts to trumpet way through UK public sector

Big data technology Hadoop is starting to appear in patches of the UK public sector, including GCHQ, HMRC, the Home Office and the NHS

It was named after a child’s toy, when software developer Doug Cutting borrowed the word from his son for “a way to take a bunch of computers and make them appear as one computer to software”, according to a video he recorded for the big data management software’s recent 10th anniversary. But Hadoop’s first public sector users...

were far from cuddly.

In 2009, the US National Security Agency (NSA) acknowledged it was using Hadoop, and in 2011 it transferred control of its data storage and retrieval system Accumulo, built on top of Hadoop, to software foundation Apache as an open source project, giving it the same status as Hadoop itself.

GCHQ also lists Hadoop as a technology used by staff working in a wide variety of areas, including data access, data science, networks, high-performance technology, the internet and data mining. The agency has joined the NSA in contributing to the Hadoop ecosystem by making Gaffer, a framework for building systems to store and process graphs on a Hadoop cluster running Accumulo, available through Github.

The use of Hadoop is now spreading across the UK public sector, with HM Revenue and Customs (HMRC) and the Home Office involved in significant projects. Both are doing so with commercial distributors of the software: Cloudera for HMRC, and Hortonworks for the Home Office.

Cloudera at HMRC

“HMRC has built an enterprise data hub – a powerful central repository for all of its data, which will help it to personalise services to customers and strengthen its compliance work,” says a spokesperson for the tax agency.

“HMRC will be able to store and analyse data using a mix of open source and closed source tools, and commodity hardware, representing better value for money for taxpayers. With HMRC’s data in Hadoop it will deliver new operational efficiency combined with the ability to analyse the data and gain insights that were previously extremely difficult to discover.”

HMRC spent £7.4m on Cloudera in January 2015, and a further £860,000 in February 2016. The Cloudera project represented what a document released in July 2015 called “further spend” on the enterprise data hub. The supplier says the project has been running for two years.

According to Cloudera, the National Crime Agency’s National Cyber Crime Unit is also “developing a Cloudera platform to gain new insights and greater intelligence from the varied types of data that it uses”.

Meanwhile, the Office for National Statistics reported it had spent £242,000 with Cloudera in May 2016, and has been advertising for a supplier with experience of Cloudera’s software stack to work on its address index referencing service, which will be used as part of the 2021 Census.

Untapped Hadoop potential

“Government, like a lot of old industry, has to go through that digital transformation, modernising its digital architecture, breaking down those silos,” says Stephen Line, Cloudera vice-president for northern Europe. “The UK is not necessarily behind or ahead particularly.”

But he adds there is untapped potential for smaller parts of the public sector to make use of Hadoop. Local authorities looking for fraud or NHS organisations working to improve their performance could benefit.

Line points to an unnamed US hospital group using Cloudera to reduce re-admission rates by predicting which patients have a high risk of re-admission and giving them additional medical care. Cloudera says its customer avoids 6,000 re-admissions annually and saves $76m in costs and Medicare penalties through use of the system. At present, the supplier does not have any NHS customers, although Line says there are “ongoing conversations”.

Hortonworks at the Home Office

Spending data from the Home Office records two payments to Cloudera’s rival Hortonworks of £53,000 in August 2014 and a further £61,000 in October of that year.

Hortonworks refused to contribute to this article, but in September 2015 the Home Office’s senior enterprise architect, Wayne Horkan, told a company event that its attraction was its “alignment to open source… it protects us from vendor change and lock-in, which we are not too keen on at the moment”. In comments published by Diginomica, he added: “The other piece is that you roll up everything together, get consistent build and delivery – that’s really useful to us. There is also a maturity in the ecosystem.”

The Home Office is using Hadoop as part of its project for technology platforms for tomorrow to link up numerous databases managed by the department, according to a presentation made at a Hadoop Users Group UK meeting earlier this year and reported by The Register. This could allow policing and security data on individuals to be joined up for the use of border officials and police officers, and also for machine learning.

The Home Office’s work with Hadoop was criticised by Liberal Democrat leader Tim Farron, who said: “Trying to get away with a substantial change simply by labelling it as IT replatforming is simply unacceptable.” He told The Register: “Trying to bypass parliament is not an option and the home secretary must come clean about her real intentions,” referring to now prime minister Theresa May.

MapR in India and the US DHS

MapR, another distributor of Hadoop, has not announced any UK public sector clients, but does count India’s Aadhaar national biometric identity scheme as a customer. Aadhaar, which uses both iris and fingerprint scans and recently enrolled its one billionth subject, has millions of concurrent access requests and uses four different datacentres.

Jack Norris, MapR’s senior vice-president of data and applications, says the software is also used in the US by the Department of Homeland Security to analyse every entrance and exit from the US, by the Drug Enforcement Agency to identify targets, and by the military to collect data from video feeds and drones. But its use is also growing in healthcare, for genomic sequencing and to reduce the incidence of sepsis by tracking hospital infections.

Norris says there is potential for the public sector to use Hadoop to replace sampling and surveys with analysis of actual data, which can be more cost-effective as well as comprehensive. “In the past, you were rewarded for sampling the data, because it was so expensive to scale,” he says. But now, “rather than looking at a sample of the data, you can actually evaluate the whole population, and anomalies are much easier to detect”.

Handling sensitive data

There can be problems when the data concerned is sensitive or personal. Norris says that for financial services customers that need help with modelling, MapR has produced synthetic versions of data, a process described by the firm’s chief applications architect Ted Dunning in a book he co-authored, entitled Sharing big data safely. “They described the dataset, but no private information and no details entered our office, just descriptions of what that data looked like,” says Norris. Using this, MapR was able to develop a model with the customer.

Ken Heafield, a lecturer in data science at University of Edinburgh’s school of informatics, says there are other potential uses for Hadoop in the state sector. “A really nice use of Hadoop is exploiting value in one’s archives,” he says.

But he adds that some see it as the default option for any dataset that is too big to fit on a spreadsheet. “There are much simpler tools for processing data that is about a gigabyte in size,” he says, such as commercial software. “Maybe in the terabyte range, which does exist in the public sector, then Hadoop becomes the appropriate tool.

“The definition of big data I like to give to my students is: When you’ve done a reasonable job at optimising on a single machine and it’s still too slow.”

Hadoop alternatives

Heafield says that even for those handling big data there are alternatives to Hadoop, which is best suited for use with data stored on disks. Apache Spark is one.

“Spark really shines if all of your data will fit into RAM on the cluster, which is true for a lot of problems,” he says. “If the data can be processed in RAM, Spark will be much faster than Hadoop.”

GCHQ lists Spark as one of its technologies, and Cloudera, Hortonworks and MapR have all brought the software into what they offer. They have made similar moves with Apache Storm, which is designed to handle online data.

The university’s director of partnerships, David Richardson, warns that the state sector’s use of Hadoop and related software may be slowed by a shortage of appropriately trained staff. “We put out a number of graduates with skills in this area, but a lot more are required,” he says.

The best results come from those who can understand both Hadoop and its alternatives and also the nature of specialist public sector data, such as NHS medical records. Such people are “incredibly valuable”, says Heafield, who believes that government organisations will benefit from having such staff in-house to commission and manage big data projects.

Read more about Hadoop

Read more on Big data analytics