Big data and master data management more coupled than expected

Master data management provides the dimensions for big data facts, as a slew of big data projects highlight a link to data governance and quality

Are big data and master data management connected?

A former colleague of mine once described the software industry as a fashion business, writes Andy Hayler, and there is nothing more fashionable now than big data. Every conference with big data in the title is packed, and every vendor is frantically updating their PowerPoint slides to tell their big data story (a few are even writing some code to go with their PowerPoint).

This trend has not gone unnoticed in the world of master data management (MDM), but here the link is a little hazy. If you sell databases for a living then there is an obvious connection, but master data (shared data such as customer, product and location) is usually relatively small in volume: a few million product data records would count as large and only in B2C companies do you get customer records in the tens or sometimes low hundreds of millions, able to be dealt with comfortably by current database technology.

Big data is about much larger volumes, so large current databases struggle to handle it. We are not talking about accounting data here, but usually data generated on the web or via machines, such as web logs or sensor data. A jet airliner generates a massive 20TB of diagnostic data an hour. Such volumes are why current databases are straining to cope. In 2003 the largest data warehouse in the world was 30TB in size but, by late 2012, Teradata had 25 customers with petabyte-sized data warehouses, an increase of over 30 times in a decade.

However, while such data volumes are clearly a challenge, it is less clear how connected they are to master data. The Information Difference, an analyst firm of which I am CEO, conducted a survey in late 2012 to try and get some hard data to help peer through the hype.

In the survey, 209 companies shared their views on the subject, split roughly evenly between north American and European companies, plus a few (11%) Asian respondents.

A resounding 77% of companies claimed big data was important to them, not in itself surprising, given the current level of media attention on the subject. What was striking was that no less than 19% of companies had a live, production big data application, with a further 20% due to go live by the end of 2012. Of course the survey respondents were to some extent self-selecting, in that they knew the topic was big data. Even so, this high level of active projects – as distinct from research – was much greater than we had anticipated.

Hadoop used in anger

Of those with active projects, 80% were using Hadoop, the evidently preferred technology to tackle such data. Hadoop is a combination of a distributed file system (HDFS) and a distributed programming model (MapReduce) owned by the open-source Apache Software Foundation.

The next area the survey went into was just how big is big data? Intriguingly, of the 209 respondents, only 23% had big data applications at over 100TB in size, just 10% over 500TB. Yet the problem is growing at a fair lick: fully 49% said their data volumes were growing annually at 20-50% and a fifth had growth rates over 50%. The same sample had live MDM implementations in 56% of cases, with a further 14% imminent, in line with previous Information Difference surveys. Hence, from the sample base, there were plenty of companies with both MDM and active big data projects.

So, the big question was: are these linked? Fully 59% claim that they are, with only 7% seeing no link between MDM and big data. The survey asked a number of further questions about the likely ways in which the two could interact: you can imagine that an existing MDM hub could provide the (say) customer data that could help drive an analysis of web traffic, perhaps looking for multi-channel behaviour of existing customers.

It is also possible to imagine the opposite direction, with big data analysis throwing up new master data that could be fed into a master data hub. It turned out that 67% of survey respondents saw MDM driving big data, rather than the other way around, with just 17% seeing big data producing new master data. The most popular choice was for existing MDM data to help drive big data searches. The survey also asked what key future requirements would be and here the most popular request was for the capability to auto-identify master data with big data datasets, such as spotting customer accounts. Just 8% felt it was important MDM technologies use big data techniques to speed up their processing.

The survey also asked what connection, if any, was perceived between big data and existing initiatives around data quality and data governance. The response was clear: a resounding 94% felt data governance was either “important” or “essential” to big data. It was almost as clear with data quality, with 80% saying data quality was of “key importance” to big data projects. These are certainly worthy intentions, although whether companies really live up to these ideals in practice is less clear: separate Information Difference surveys show that around 30% of companies have no data quality tools at all, and few of the 70% that do have them widely deployed across their enterprises. It is also far from clear that many data quality suppliers today have a compelling big data story.

What does this all mean? Firstly it is clear that big data is not just hype and a sizeable minority of companies are engaged in big data projects, not just research. It was also striking how much there was a desire to incorporate big data with existing initiatives such as data governance and even data quality. It was also clear MDM is seen as a resource to drive big data analysis. To use data warehousing terminology, MDM can provide the dimensions for big data facts. Perhaps not everything is new after all.

About the author

Andy Hayler is co-founder and CEO of The Information Difference and a keynote speaker at conferences on master data management, data governance and data quality. He is also a restaurant critic and author (

Read more on Master data management (MDM) and integration

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

Detailed archiving of your info; be it family albums or music files will be an ongoing task but well worthwhile.
With a few keystrokes you can automatically catalog existing "MPG" video, "MP3" audio, "JPG" pictures and "TXT" email records & notes etc.
Moments later you will be randomly sampling your data treasures.

It takes some time for Your personal database to become large enough to make searching interesting. If your memory is short then not so long.

This app plows through text files at 20,000,000 CPS and beyond on a 8 year old LapTop..

I have completed over 4 Dozen Telephone and CableSystem billing conversions (ETL) 95% of that data came
in text files, a small percent was packed integer and real numbers etc. The Export data included toll files, work orders, customer details etc. These data were no problem for this app.

Many of these IT jobs were for some of the largest companies in the Western Canada and US.
The so called "Big Data" isn't that BIG for today's computers. I have more personal data than all the ETLs combined

To keep You in touch with the massive amounts of DATA you'll collect; the app can randomly sample Video or audio segments as easily as family pictures.
Text data can be displayed "in context" or "matching lines only" along with match counts, line counts and elapsed time.

Without a Random option; computer resources go unused and your data mining tools will fall short.
Video playback options such as Fast Forward, Slow Motion and Large font captioning mixed with Video segments are a few of the
main features. There is no more useful app than this.

See the thread "nobody shares knowledge better than this" for all the details