Microsoft gets flack over "rubbish" UK data

Sir Tim Berners Lee calling for RAW DATA NOW at TED 2009.pngData experts and government officials have fingered Microsoft’s popular Excel spreadsheet as a source of gremlins troubling the UK government’s reform programme.

Four years after the coalition government began its attempt to create the most open and transparent democracy in the world, technical problems persist.

With only half a year until the end of its term, the coalition may leave government with its flagship transparency reform in a bodge.

It sought simply to publish public records on the internet as open data: a form people could scrutinize easily using computers. It would create an army of “armchair auditors” who would hold public bodies to account.

But the scheme has languished since its launch four years ago: irregularities still plague the data, making it difficult for all but computer experts to scrutinze it.

Traced to its source, the bug leads to the same problem that upset other coalition transparency reforms.

That was vested interests. As it happens, prime minister David Cameron’s plan for government was all about dismantling vested interests. His reforms were humbled by the same vested interests they sought to undo.

The gremlins that infested his government’s data came from a forsaken backwater of computing called character encoding (apparently overlooked in the coalition government’s plans). The Cabinet Office admitted character encoding was still a problem for the most crucial part of its transparency reforms after the issue was exposed in Computer Weekly last week.

Even the World Wide Web Consortium (W3C), the high temple of the global computing movement that inspired Cameron’s transparency reforms, has struggled with encoding.

In a nutshell

The encoding problem was thus: the government had no standard way to encode data – no standard way to take the letters and numbers people read on screens and to represent them in codes computers could handle. So its attempts to publish public spending were flawed by the incompatibility of the data they released.

The Department for Work and Pensions, for example, publishes around a quarter of a million spending records every year.

The whole point of doing this was to help private companies compete over public services and contracts; and to help patriotic citizens terrorize public bodies with awkward questions about the minutiae of their budgetary records.

The entire initiative would be futile if the public could not easily draw meaning from that data. But the data was being released in batches that were incompatible with one another.

One of the reasons for this shambles was ecoding. It was illustrated by a similar problem W3C had with video formats. Software vendors cornered the digital video market with “proprietary” video codecs. That is, they claimed property rights over the codes that represented to a computer the images people see. Market forces gave their codecs power. That power forced people to use their video encodings rather than anybody else’s. And that undermined the principle of the web that communication would be uninhibited by vested interests. That was where the coalition government was coming from with its own policy to deliver open data. It had to be uninhibited.

Cameron and W3C wanted it to be like pen and paper. Imagine if it was the other way around. Imagine a world that had “proprietary” pens.

You would sit down to write in lament of vested interests only to find your Bic pen, say, would only write on Bic paper. So instead of writing poetry you’d be shaking out your piggy bank or breaking rocks to get money to buy Bic’s proprietary paper.

The Conservative Party formulated a computer-led reform plan that would tolerate no proprietary claims over the vehicles of digital communication.

In reality, their plan presumed to defy global commercial forces: companies like Adobe and Microsoft, with a global base of customers already tied into using their proprietary formats. A laissez faire legislator like Cameron would move these vested interests as he might move a sand dune with hands, and some blowing and coaxing.

But that was a bit of a side-show really. It got more attention than it deserved because office documents were something everyone could relate to.

Cameron’s liberation policy was concerned primarily with the only thing his government had rights over itself: its own data.

Labour <=> Conservative

200px-David_Cameron_St_Stephen's_Club_2_cropped.jpgThe idea, as presented by the prime minister: transparent budgets and open data would make government more efficient and accountable. Costs would be cut. Plebs would be empowered.

That was actually how Gordon Brown, the last prime minister, put it just before he got voted out of office in 2010.

You could swap his name with Cameron’s and (largely) not tell the difference in what they said.

You could trace Brown’s data liberty schpiel to the same wellspring as Cameron: national nerd hero and world-wide web founder Sir Tim Berners Lee.

Both prime ministers took up the cause Berners Lee had dedicated his life to: the common basis of communicating, sharing and combining data that was the foundation of the world-wide web. In respect of him, they made this a cause of national pride and the basis of reform.

Hence primes ministers Brown and Cameron planned for government data to be “linked”, in the way Berners Lee had been urging it should be.

That meant had to be possible take any bucket of data and combine it with any other, and to arrange the lot in any way your fancy chose. That meant the data had to be good quality. It had to be comprehensible to computer.

200px-Gordon_Brown_Davos_2008_crop.jpgBetween them, Brown and Cameron set Berners Lee up with a £10m office called the Open Data Institute, to aid UK policy implementation of his ideas. A sort of colonial office of the W3C (of which Lee is founding director), it was going to make sure UK data lived up to the reform schpiel.

Cameron kicked it all off in 2010 by publishing public spending records as open data. Four years later, that data is effectively incomprehensible. The the ODI is still trying to make it linkable. Britain’s aspirations therefore to be the most open and transparent government in the world, the world leader in open data, the most efficient, open and responsive government in the world, are still work in progress.

Some of the most prominent government data experts confirmed what the data itself had already said about its own poor quality.


Jeni Tennison - Open Data Institute.pngUK spending data was “horrendous”, Jeni Tennison, technical director of Berners Lee’s Open Data Institute, told Computer Weekly.

“It’s ridiculous,” she said.

Even when computer experts tried to link this data they had to jump though such hoops that it was “shocking”, said Tennison, who got an OBE for her work last year and sits on the Cabinet Office Open Standards Board and Open Data Panel.

“It isn’t like we are in a state where the data is basically okay and it just takes a bit of effort to put it together. We are talking about a state where it’s basically rubbish,” she said.

Companies that set themselves up to do innovative things with UK spending data had to spend 80 per cent of their time simply tidying it up so they could even start to work with it.

UK spending data rubbish was rubbish because it had incompatible encoding. Staff were largely powerless to do anything about it. Because their software was at fault.

Microsoft’s Excel spreadsheet has got most of the blame for this.

The problem, according to Tennison and other experts, and just about any forum that addresses the subject online, was Microsoft’s atrocious handling of UTF-8, the character encoding widely favoured as the lingua franca of open data.

UTF-8 became encoding-of-choice for the UK government as well as the world wide web. But most of government was using Microsoft software. Microsoft’s UTF-8 incompatibilities have long been condemned by experts. The problem was inherent to both Microsoft Windows and its applications, most notably Excel. Users could circumvent them by following complicated instructions. But the workarounds were arduous. This was problematic for government, where most staff use Microsoft software but were apparently not shown how to get to work with UTF-8. More recent versions of Microsoft software employed codecs related to UTF-8 but not compatible with it.

“Popular spreadsheet applications”, as Tennison put it, made it hard for users to encode their data in a format that would be universally compatible.

Technical obstacles

“When you export from popular spreadsheet applications you don’t get control over encoding and it usually chooses a bad one,” she said. “It usually won’t be UTF-8. It will usually be something like Windows 1252.”

Windows 1252 was an old, proprietary Microsoft encoding. The result, said Tennison, was the data contained characters incomprehensible to other people and programs. Their systems – unless they were using Microsoft Excel on a Microsoft Windows computer – interpreted the incomprehensible characters as “garbage”.

“It can cause problems matching stuff up,” she said. “If you have the name correct in some data and not in other data then you can’t match those two names together. And therefore you can’t put the data together accurately.”

Ian Makgill - Spend Matters - On panel at Open Data Institute members networking event - 26 March 2014.pngIan Makgill, managing director of Spend Network, a start-up trying to clean up government spending data, concurred with the ODI.

“A lot of the problems are with Microsoft Excel not being able to output open [data] because it likes proprietary formats,” he said.

“It’s damaging. Microsoft’s handling of these things is a problem. Different versions of Microsoft Excel have different formats.

“They default to proprietary formats… because that makes data available in other products,” said Makgill, who is regularly cited by other prominent UK data experts as the leading authority on  government spending data quality.

Makgill and other experts said the Microsoft problem was not only its handling of UTF-8, but the difficulties it created for people who wanted to publish their open data in a universally compatible file format. HM Treasury said in 2010 its open data should be published in .csv file format (comma-separated values). But Microsoft didn’t handle this most simple of file formats well. This had further helped degrade the UK’s open data quality.

Hushed words

Source.jpgComputer Weekly learned through an unofficial government channel that the UK Cabinet Office, which is responsible for the UK’s open source, open data and open standards policy, also blamed Microsoft’s software for hindering its work.

“There are several issues with saving UTF-8-compliant .csv files from Excel,” said a source close to the Cabinet Office.

Another Cabinet Office source said government data was going out with mistranslated pound signs after being exported by Excel. Government guidance in 2010 said departments should leave pound signs off their payment amounts. But departments still put them in. So their output was garbled. Makgill said apostrophes caused similar problems.

These hushed words, by the way, were from officials in a government that stands for transparency. Its transparency only applies in areas where it is in its own interest to cause disruption. That does not extend to itself.

Thumbnail image for Harvey Lewis - Deloitte.png“Microsoft data files are always a bit of a challenge,” said Harvey Lewis, head of data analytics at consulting firm Deloitte.

But data quality was not a big issue for Lewis.

The government had rushed its data out in 2010 in respect of the Sir Tim Berners Lee’s famous geek plea for “raw data now!“, made at a 2009 conference for the sci-tech elite in California.

The government had always intended to get its data out first and then clean it up later.

And, said Lewis, open data had been for government primarily an innovation policy – a means to stimulate the economy. For companies like Spend Network  to thrive from selling linked data services from gov data that had to be cleaned up before it could be linked, the public might have to accept that government will go on spewing out raw data.

Treasury oopsy

HM Treasury did indeed tell civil servants they should publish data now and perfect it later. It even referenced Sir Tim’s own advice.

“The focus of the guidance is on how, pragmatically, to make the data available quickly
rather than seeking to achieve full alignment across every entity,” it said.

“Publishing raw data quickly is an immediate priority, but we are working towards producing structured, regularly updated data published using open standards,” it said.

People in and around the Cabinet Office said the ongoing problem is that people don’t know how to persuade their Microsoft software to output in a universally compatible format. Four years on, they still needed training. And UK data was still rubbish.

But HM Treasury, overseen by the National Archives, established conditions for their own data initiative to struggle when they issued the guidance that set it off in 2010. They instructed government officers to publish their data in a standard Microsoft Windows encoding. It assumed they would be using Microsoft software. It imagined alternative encodings as a future possibility.

Bigger picture

Even the W3C has meanwhile struggled to establish UTF-8 as a standard way of encoding .csv files on the web.

It set up a working group last December that won’t publish its conclusions until August 2015. It does have more to contend with then character encodings. But character encoding was one of its most thorny issues, said Tennison, who co-chairs the CSV on the Web Working Group, that is addressing the issue for the W3C.

“We are leaning in the direction of UTF-8,” she said. “It should be UTF-8”.

ODI has simultaneously been trying to persuade government departments to clean their data up using a tool it produced, and to join a certification scheme to improve other elements of their data publications. Departments have shown little interest, despite the poor state of government data.

Vested Interests

Some departments have been so reluctant to even release data that SpendNetwork  had to order their release under Freedom of Information law. Wigan Council would only release spend data after the Information Commissioner intervened. The Ministry of Justice fought all the way to an Information Tribunal.

316px-BorisJohnsonSept08.jpgA similar initiative by London Mayor Boris Johnson floundered for six years because civil servants refused to allow their data to be published. The open data initiative was part of the Conservative Party’s plan to break up the public sector. Gordon Brown’s proposals were not dissimilar.

Johnson put it in his 2008 manifesto with help from Cameron’s campaign team. His Greater London Authority’s Oversight Committee said last June something ought to be done about London’s poor spending transparency. Civil servants were not co-operating. It traced the problem to the vested interests of the companies whose business dealings were exposed in the spending records. Civil servants might also have had an interest in non-co-operation with the means of their own demise. The coalition plan has aimed for 80 per cent cuts in operational jobs in the civil service.

The coalition claimed on coming to government that its primary interest was challenging the vested interests of corporate IT suppliers. Those interests have prevented it even publishing its own data effectively. Its grander plan to challenge vested interests it saw in the public sector was consequently obstructed.

Microsoft would not talk about either about UTF-8 encoding or its problem with .csv files.

“Modern versions [of Microsoft software] support the most popular standard document formats including PDF, ODF, and Open XML,” it said in a written statement.

This, it said, meant applications such as Excel would export “to other programmes which use open standards”. It said people should contact their Microsoft supplier if they had any issues.