Are big data and master data management connected?
A former colleague of mine once described the software industry as a fashion business, writes Andy Hayler, and there is nothing more fashionable now than big data. Every conference with big data in the title is packed, and every vendor is frantically updating their PowerPoint slides to tell their big data story (a few are even writing some code to go with their PowerPoint).
This trend has not gone unnoticed in the world of master data management (MDM), but here the link is a little hazy. If you sell databases for a living then there is an obvious connection, but master data (shared data such as customer, product and location) is usually relatively small in volume: a few million product data records would count as large and only in B2C companies do you get customer records in the tens or sometimes low hundreds of millions, able to be dealt with comfortably by current database technology.
Big data is about much larger volumes, so large current databases struggle to handle it. We are not talking about accounting data here, but usually data generated on the web or via machines, such as web logs or sensor data. A jet airliner generates a massive 20TB of diagnostic data an hour. Such volumes are why current databases are straining to cope. In 2003 the largest data warehouse in the world was 30TB in size but, by late 2012, Teradata had 25 customers with petabyte-sized data warehouses, an increase of over 30 times in a decade.
However, while such data volumes are clearly a challenge, it is less clear how connected they are to master data. The Information Difference, an analyst firm of which I am CEO, conducted a survey in late 2012 to try and get some hard data to help peer through the hype.
In the survey, 209 companies shared their views on the subject, split roughly evenly between north American and European companies, plus a few (11%) Asian respondents.
A resounding 77% of companies claimed big data was important to them, not in itself surprising, given the current level of media attention on the subject. What was striking was that no less than 19% of companies had a live, production big data application, with a further 20% due to go live by the end of 2012. Of course the survey respondents were to some extent self-selecting, in that they knew the topic was big data. Even so, this high level of active projects – as distinct from research – was much greater than we had anticipated.
Hadoop used in anger
Of those with active projects, 80% were using Hadoop, the evidently preferred technology to tackle such data. Hadoop is a combination of a distributed file system (HDFS) and a distributed programming model (MapReduce) owned by the open-source Apache Software Foundation.
The next area the survey went into was just how big is big data? Intriguingly, of the 209 respondents, only 23% had big data applications at over 100TB in size, just 10% over 500TB. Yet the problem is growing at a fair lick: fully 49% said their data volumes were growing annually at 20-50% and a fifth had growth rates over 50%. The same sample had live MDM implementations in 56% of cases, with a further 14% imminent, in line with previous Information Difference surveys. Hence, from the sample base, there were plenty of companies with both MDM and active big data projects.
So, the big question was: are these linked? Fully 59% claim that they are, with only 7% seeing no link between MDM and big data. The survey asked a number of further questions about the likely ways in which the two could interact: you can imagine that an existing MDM hub could provide the (say) customer data that could help drive an analysis of web traffic, perhaps looking for multi-channel behaviour of existing customers.
It is also possible to imagine the opposite direction, with big data analysis throwing up new master data that could be fed into a master data hub. It turned out that 67% of survey respondents saw MDM driving big data, rather than the other way around, with just 17% seeing big data producing new master data. The most popular choice was for existing MDM data to help drive big data searches. The survey also asked what key future requirements would be and here the most popular request was for the capability to auto-identify master data with big data datasets, such as spotting customer accounts. Just 8% felt it was important MDM technologies use big data techniques to speed up their processing.
The survey also asked what connection, if any, was perceived between big data and existing initiatives around data quality and data governance. The response was clear: a resounding 94% felt data governance was either “important” or “essential” to big data. It was almost as clear with data quality, with 80% saying data quality was of “key importance” to big data projects. These are certainly worthy intentions, although whether companies really live up to these ideals in practice is less clear: separate Information Difference surveys show that around 30% of companies have no data quality tools at all, and few of the 70% that do have them widely deployed across their enterprises. It is also far from clear that many data quality suppliers today have a compelling big data story.
What does this all mean? Firstly it is clear that big data is not just hype and a sizeable minority of companies are engaged in big data projects, not just research. It was also striking how much there was a desire to incorporate big data with existing initiatives such as data governance and even data quality. It was also clear MDM is seen as a resource to drive big data analysis. To use data warehousing terminology, MDM can provide the dimensions for big data facts. Perhaps not everything is new after all.
About the author
Andy Hayler is co-founder and CEO of The Information Difference and a keynote speaker at conferences on master data management, data governance and data quality. He is also a restaurant critic and author (www.andyhayler.com)