Businesses dig for treasure in open data

Open data, a movement that promises access to vast swaths of information held by public bodies, has begun getting its hands dirty, or rather its feet

Open data, a movement which promises access to vast swaths of information held by public bodies, has started getting its hands dirty, or rather its feet.

Before a spade goes in the ground, construction and civil engineering projects face a great unknown: what is down there? In the UK, should someone discover anything of archaeological importance, a project can be halted – sometimes for months – while researchers study the site and remove artefacts.

Research by the City of London revealed the average cost of archaeological work in its jurisdiction is between 1% and 3% of total construction costs. However, the local authority says: “The archaeological cost exceeds 3% only in very rare circumstances but more frequently it is significantly less than 1%.”

For projects worth hundreds of millions – if not billions – of pounds, developers must accept this wide uncertainty in their costs as part of the price of creating housing, commercial buildings and infrastructure. Any narrowing of this uncertainty could help in planning and funding these projects, and this is where researchers in open data now believe they can help.

Open data analytics to aid archaeologists

During an open innovation day hosted by the Science and Technologies Facilities Council (STFC), open data services and technology firm Democrata proposed analytics could predict the likelihood of unearthing an archaeological find in any given location. This would help developers understand the likely risks to construction and would assist archaeologists in targeting digs more accurately. The idea was inspired by a presentation from the Archaeological Data Service in the UK at the event in June 2014.

The proposal won support from the STFC which, together with IBM, provided a nine-strong development team and access to the Hartree Centre’s supercomputer – a 131,000 core high-performance facility. For natural language processing of historic documents, the system uses two components of IBM’s Watson – the AI service which famously won the US TV quiz show Jeopardy. The system uses SPSS modelling software, the language R for algorithm development and Hadoop data repositories.

Democrata co-founder and CEO Geoff Roberts says if a large engineering or building company hits a piece of archaeology that can stop its project, it can have a huge impact on costs and the project's length.

“What we are trying to do is bring prediction and visualisation into the equation, to show where there might be archaeology," he says. "They are the outputs. There is prior art in this at a very small level by academics, but no one has approached this with a big data and big analytics frame of mind.”

The proof of concept draws together data from the University of York's archaeological data, the Department of the Environment, English Heritage, Scottish Natural Heritage, Ordnance Survey, Forestry Commission, Office for National Statistics, the Land Registry and others.

Looking at data in a different way

The first challenge was getting to grips with these disparate sources, says Democrata co-founder and former SAS chief technology officer for the UK and Ireland John Morton.

“The data has usually been created for one specific business purpose," he says. "This is not a new problem – it is something you get in large corporations. But you need to look at the data in a slightly different way for a new purpose, so you start to look at completeness. 

You need to look at the data in a slightly different way for a new purpose, so you start to look at completeness

John Morton, Democrata

"The challenge is, if there are not any values, what is the default? It opens up the need to have standards and specifications around the data. This is one of the challenges that some organisations have – they start to think the data is not good enough when actually it probably is."

Morton says the team was able to tell organisations where some data sets had not been digitised, helping them understand their own data. In fact, some data from the Archaeological Data Service was corrupt, with incomplete CSV files, which the project helped to understand and correct.

Meanwhile, the team also needed to do some coding to help integrate the data. For example, some organisations described locations using post codes, while others used grid references.

The system analyses sets of indicators of archaeology, including historic population dispersal trends, specific geology, flora and fauna considerations, as well as proximity to a water source, a trail or road, standing stones and other archaeological sites. Earlier studies created a list of 45 indicators which was whittled down to seven for the proof of concept. The team used logistic regression to assess the relationship between input variables and come up with its prediction.

The initial project funding was £25,000, excluding the computing power and skills donated to it. Engineering firms, environmental impact assessors and archaeologist will be invited to review the proof of concept, initially covering England and Wales, towards the end of January 2015, Roberts says.

EU puts €14.4m into big data

Both UK and European governments provided a boost to the open data movement in late 2014. In November, the EU dedicated €14.4m (£11m) to three open data initiatives. Meanwhile, UK cabinet office minister Francis Maude says the Environment Agency will be releasing environmental data relevant to flood risk, including 15-minute readings from every river level sensor in the UK (see box).

Data deluge to combat flood risks

Flowing the flooding of early 2014, the UK government brought together 200 software developers and computer programmers to access data previously only available at a cost to a small number of insurance companies. The data sets included 15-minute readings from every river level sensor in the UK, according to cabinet office minister Francis Maude.

“Within two days, they came up with a range of solutions to help – from a phone service that connects people with their energy supplier in a power cut, to an app that alerts Twitter users to local volunteering opportunities,” he says.

In December 2014, Maude announced these data sets would be available to all as open data, bringing forward support funding from the Cabinet Office in the hope technology firms get information about flooding to the public faster and in a way they find useful and engaging. 

“There’s huge potential for technology ‘mash-ups’ between flood data and for example, Google Maps, making it more accessible and easier to use. This is just the beginning,” he says.

Forrester principal analyst Jennifer Bélissent says the push toward open data began with government austerity and a desire to be seen to be transparent and accountable. Since then, organisations both public and private have been looking to exploit open data, she says. For example, retailers are looking more seriously at demographic data before opening stores, while estate agents can provide house buyers with aggregated data from sourcing including crime statistics, school ratings and local average salaries.

But it is an area that needs to mature before government and private bodies can start to use open data and know what they are going to get out of each project, says Bélissent. “One of the challenges is understanding value. Developing a business case and return on investment [ROI] for open data is difficult. There is still a lot experimentation going on.”

According to Bélissent, public sector organisations wanting to exploit open data to direct policy and manage services face shortcomings in technology, process and governance.

“They may say they have a strategy but then you ask if they have processes in place and fewer say they do," she says. "Then you ask if they have the proper architecture to best leverage the data. The answer is fewer still. Then you ask if they can measure ROI and fewer can. Lastly, you ask if they are good at hiring and retaining talent to do this and maybe only a quarter say they do. To get value from their open data they need to leverage best practices.”

The field is likely to see third parties as government bodies are not inclined to make data easy to digest says Sean Owen, director of data science for Europe at Cloudera, a big data platform provider.

“If it is safe to release data, generally speaking, public bodies are fine to do so," he says. "The issue is maintaining and updating it. In an era of budget cuts where there is no direct incentive for the agency to bother too much, the tendency is to dump data and put it online. Maybe that is fine. Maybe we don’t need the government to update every application programming interface, or do formatting and indexing. We can leave that to third-party vendors who can make sense of it.”

Like archaeology, open data can be messy and offers no certain returns. But with public bodies willing to release data, and third-party suppliers keen to help businesses and government agencies exploit it, the movement is unlikely to be consigned to history.

Read more on Business intelligence and analytics