Is Hadoop fit for government IT purpose?

Privacy campaigners have reacted with alarm to a story that the Home Office will use Hadoop to centralise all its databases.

The story, originally published on The Register, and written by Alexander Martin, was taken up last week, though without attribution, by the Financial Times.

Its civil libertarian thrust is that the government might be planning to use the open source Hadoop software stack to create one new “mega database” to rule them all, and the citizens of the UK, without their knowledge or consent.

The Register elicited an outraged comment from Liberal Democrat leader Tim Farron to the effect that it was “simply unacceptable” to smuggle such a substantial data storage “replatforming” change under the Technology Platforms for Tomorrow (TPT) programme, which is under the leadership of Sarah Wilkinson, as chief technology officer at the Home Office.

The FT additonally sourced concerned comment from a privacy lawyer and Privacy International.

The original story is based on sight of a presentation given to a Hadoop Use Group UK meeting earlier this year.

It raises the broad question: “is Hadoop fit for government IT use?” Is there enough governance and security built into it for use in government data stores?

The Hadoop Distributed File System (HDFS) is not a database. It is a file store. A simple analogy would be to say it is like the C drive on a Windows PC (this analogy I owe to Kevin Long, formerly of Teradata and a data consultant). But, as part of other elements of the Hadoop family of technologies, it can be used to store and interrogate data of volumes, varieties, and velocities beyond the competence of relational databases. However, doing so is not simple and requires expensive data engineers, as well as data scientists to do the analysis.

This is why organisations such as banks, telecoms providers, and gaming web sites have been building data science teams to exploit the Hadoop ecosystem, which is now in its tenth year. It would be odd indeed if government departments were not using it, for science projects if not for full-blown production use cases.

Indeed Wayne Horkan, senior enterprise architect at the Home Office said, in September last year, at the opening of Hadoop distributor Hortonworks’ new office in the City of London, that his team, which is mainly focused on data use for UK border control, had been exploring open source Hadoop, and that they saw the supplier’s commitment to pure open source as “what we’ve enjoyed … it protects us from vendor lock in”.

The crucial thing for him, with Hadoop, was he said “re-scheming, breaking away from legacy relational thinking. That [what] kills us.

“[We] have a large number of existing databases and sets of data that we would like to bring together and get as much intelligence from as possible”.

(There is a full transcript of Wayne Horkan’s description of his team’s work in an article by Derek du Preez on the Diginomica web site).

Cloudera and HMRC

Cloudera, another commercial Hadoop distributor, has worked with another UK government department, HMRC, which has used the supplier’s “Enterprise Data Hub” for the department’s data analytics Connect programme, spending £7.4m in the first quarter of 2015.

Amr Awadallah, chief technology officer, Cloudera told Computer Weekly in December 2013 – off the record at that time – that though their European paying customers were drawn mostly from web companies and telecoms, they had been talking to the Government Digital Service (GDS) in the UK, as well as Mark Dearnley, the then recently appointed CIO of HMRC. Those conversations may have resulted in the £7.4m engagement reported in 2015. Awadallah also said their second “most advanced” customer in the US was the federal government.

The governance and security features of Cloudera’s “enterprise data hub” – based on Hadoop, but not reducible to it – had, he said, largely been added to it at the behest of IT professionals at customers, especially in the US with finance companies, healthcare and the government. “Early customers pushed us very hard on security. I’d like to say a lot of that came from our thinking, and we are geniuses, but no, it came from them – features like access control, data lineage, how data is transformed over time, and back-up and recovery came from them pushing us”.

The other main Hadoop distributors – Hortonworks and MapR – will make similar cases. But as deployment of the Hadoop ecosystem in government IT goes beyond science and experiment to production systems, the issue of governance will come back again and again. And its vulnerability to critique on civil liberties grounds – as a Silicon Valley-originated open source technology more extensive in its reach to unstructured data, such as photos, audio, and text than relational databases –does make Hadoop’s fitness for government IT use controversial.