This article is part of our Essential Guide: Data centre analysis and business intelligence guide

'Big data' applications bring new database choices, challenges

The dormitory world of databases has been roused from slumber by “big data” and associated technologies: MapReduce, Hadoop, MPP, NoSQL. Andy Hayler details big data’s impact and the issues it raises.

I started out my career as a systems programmer and database administrator, working on what was then the state of the art in the world of databases: IMS, from IBM. Companies needed somewhere to put (and sometimes even retrieve) their data, and once things had moved beyond basic file systems, databases were the way to go.

The volumes of data that had to be handled back then seem amusingly modest by today’s “big data” applications standards, with IBM’s 3380 mainframe able to store what seemed like a capacious 2.5 GB of data when it was launched in 1980. To put data into IMS, you needed to understand how to navigate the physical structure of the database itself, and it was a radical step indeed when IBM launched DB2 in 1983. In this new approach, programmers would write in a language called SQL and the database management system itself would figure out the best access path to the data via an “optimiser.”

For more on big data applications and technologies

The story of Hadoop and big data analytics in 2011, from

How big data puts enterprise architects on a journey of discovery

Join the big data debate: Buy tools now or fix data management issues first?

I recall some of my colleagues’ deep scepticism about such dark arts, but the relational approach caught on, and within a few years the database world was a competitive place, with Ingres, Informix, Sybase and Oracle all battling it out along with IBM in the enterprise transaction processing market. A gradual awareness that such databases were all optimised for transaction processing performance allowed a further range of innovation, and the first specialist analytical databases appeared. Products from smaller companies, with odd names like Red Brick and Essbase, became available and briefly a thousand database flowers bloomed.

By the dawn of the millennium, the excitement had died down, and the database market had seen dramatic consolidation. Oracle, IBM and Microsoft dominated the landscape, having either bought out or crushed most of the competition. Object databases were snuffed out, and the database administrator beginning his or her career in 2001 could look forward to a stable world of a few databases, all based on SQL. Teradata had carved out a niche at the high-volume end, and Sybase had innovated with columnar instead of row-based storage, but the main debates in database circles were reduced to arguments over arcane revisions to the SQL standard. The database world had seemingly grown up and put on a cardigan and slippers.

How misleading that picture turned out to be. Few appreciated at the time that rapid growth in both the volume and types of data that companies collect was about to challenge the database incumbents and spawn another round of innovation. Whilst Moore’s Law was holding for processing speed, it was most decidedly not working for disk access speed. Solid-state drives helped some, but they were, and still are, very expensive. Database volumes were increasing faster than ever due primarily to the explosion of social media data and machine-generated data, such as information from sensors, point-of-sale systems, mobile phone network, Web server logs and the like.

In 1997 (according to Winter Corp., which measures such things), the largest commercial database in the world was 7 TB in size, and that figure had only grown to about 30 TB by 2003. Yet it more than tripled to 100 TB by 2005, and by 2008 the first petabyte-sized database appeared. In other words, the largest databases increased tenfold in size between 2005 and 2008. The strains of analysing such volumes of data started to stretch and exceed the capacity of the mainstream databases.

Enter MPP, columnar and Hadoop
The database industry has responded in a number of ways. Throwing hardware at the problem was one way. Massively parallel processing (MPP) databases allow database loads to be split amongst many processors. The columnar data structure pioneered by Sybase turned out to be well suited to analytical processing workloads, and a range of new analytical databases sprang up, often combining columnar and MPP approaches. The giant database vendors responded with either their own versions or by simply purchasing upstart rivals. For example, Oracle brought out its Exadata offering, IBM purchased Netezza and Microsoft bought DATAllegro. There are also a range of independent alternatives remaining on the market.

However, the big data challenge is of such a scale that more radical approaches have also appeared. Google, having to deal with exponentially growing Web traffic, devised an approach called MapReduce, designed to work with a massively distributed file system. That work inspired an open source technology called Hadoop, along with an associated file system called HDFS. New databases followed that spurned SQL entirely or in large part, endeavouring to allow more predictable scalability and eliminating the constraints of a fixed database schema.

This “NoSQL” approach brings with it a range of issues. A generation of programmers and software products have relied on a common standard for database access, with the removal of the need to understand internal database structure allowing considerable productivity gains. Programming for big data applications is an altogether trickier affair, and IT departments that are staffed with people who understand SQL are ill-equipped to tackle the world of MapReduce programming, parallel programming and key-value databases that is starting to represent the state of the art in tackling very large data sets. There are also considerable challenges to the new database technologies in coping with high availability, guaranteed consistency and tolerance to hardware failure, things which many organisations had previously started to take for granted.

Of course, not everyone is equally affected by such developments. The big data issues are most acutely felt in certain industries, such as Web marketing and advertising, telecoms, retail and financial services, and certain government activities. Understanding the relationships between data is important in areas as diverse as fraud detection, counter-terrorism, medical research and energy metering.

However, the recent data explosion is going to make life difficult in many industries, and those companies that can adapt well and gain the ability to analyse such data will have a considerable advantage over those that lag. New skill sets are going to be needed, and these skills will be scarce. Companies need to explore the newer approaches to handling large data volumes and begin to understand the limitations and challenges that come with technologies like Hadoop and NoSQL databases, if they are to avoid being swept away by the big data tidal wave.

About the author

Andy Hayler is co-founder and CEO of analyst firm The Information Difference and a regular keynote speaker at international conferences on MDM, data governance and data quality. He is also a respected restaurant critic and author (see

Read more on Database management