I have watched attempts to produce automated means of tracking and tracing the provenance of on-line data for well over a decade - as a succession of snake-oil salesmen have tried to persuade naive users and politicians that their mash-up tools will turn an "on-line waste tip of unvalidated government data files" into something more than e-slurry.
I had hoped to have a speaker on progress with the Semantic Web at the recent "Uncovering the truth" workshop on data quality organised by the Information Society Alliance (EURIM) and the Audit Commission because I had long thought it provides part of the "answer".
However Sean Barker has suggested that it is the little more than latest excuse for not applying traditional data standards: an expensive academic exercise that will led no-where. I therefore asked him to do a "guest blog". I will not comment further and await your comments.
"In a service-based contract, the user effectively hires some equipment, and every so often, sends it back to the supplier for maintenance. In this context, some years ago I was party to a discussion on data quality for feedback data. When a civil servant in the room realised that we wanted to assess government departments, with an oleaginous smile, he observed that as they set the contract, it was up to them what quality of data they would provide. To which an industrialist said, "Yes, indeed, but you realise we will charge you a risk premium against our costs for your bad data?" The oleaginous smile more than disappeared. Data quality is something that has to be measures in pounds and pence. But what has this to do with the Semantic Web?
The Semantic Web is a marvel that will allow you to ask your phone for "a decent Chinese restaurant near Covent Garden tonight", and it will come back with a list of restaurants, ranked by proximity, having filtered out those with poor ratings or no tables. All you need to do then is pick your preference, and it will book for you. If you think an app like this is a long way off, then check out Siri (though only in the US as yet). However, somewhere under the hood are hand written translators which integrate the various services that the app uses - semantic web informed perhaps, but not actually Semantic Web.
The myth of the Semantic Web is that it will be the silver bullet that solves all data interoperability problems automagically. The reality is that it will solve a number of very specific problems, but on the Web, what will cripple it is data quality. This is not the simple problem of data errors, but goes to the heart of much that is wrong with the Semantic Web. Computers are always part of a system that involves human goals and aspirations, and yet the semantics of the Semantic Web is only a mathematical exactitude about the relationship between two otherwise undefined symbols. To make those symbols useful, somebody has to use their brain, and make sure that the symbols mean exactly what they are supposed to mean - which is where data quality comes in.
Data quality is about ensuring that data means exactly what it says. Unfortunately that means that everybody who sees the data should understand it in the same way, and on the Web, that means anyone in the world. If you have ever done a data integration problem, you know how hard that is to achieve even in a single company. For example, how long is a man year: 2200 hours (the number of hours I get paid, including holidays)? or 1700 hours (the hours I'm supposed to be clocked in)? or 1500 hours (the hours a project manager can expect, after allowing for training, etc.)? Getting such facts wrong wastes hours sorting out the systems that have used them.
Unfortunately there seem to be far more academic brownie points writing papers involving complex proofs for obscure points in logic than in solving the poorly characterised problems of data quality, or in training people to understand how they - and other people - actually use data. Which is why I (among others wonder whether the £40 millions being used to set up the new British Institution for Web Science is money well spent.
While I would not want to see academics starve, the government would be better off implementing conventional data standards, such as EDXL for communication between emergency services. And the cost savings for government from actually using existing standards could be enormous. The medium is the message, and in this case, the message "create a Semantic Web Institute" is that data interoperability is a terrifyingly difficult technical problem best solved by academics. It is not. It is a painfully methodical approach to finding out what people say and what exactly they mean by what they say, and then checking whether two people say and mean the same thing. Its more about people than machines, and particularly about understanding precisely what they mean when they say something. For those brought up with the oleaginous obfuscations of "Yes, Minister", this is probably why the Sir Humphrey's of the civil service would rather we were distracted by a Web Science Institute."
P.S. From Sean Barker on 29th April - I cite Siri as an example of semantic web type applications. It may be worth adding a comment that they have been bought by Apple.