Hadoop for big data puts architects on journey of discovery

Solving the problems posed by big data has led architects beyond traditional database and business intelligence technologies to open source Hadoop.

“Problems don’t care how you solve them,” said James Kobielus, an analyst with Forrester Research, in a recent blog on the topic of ‘big data’. “The only thing that matters is that you do indeed solve them, using any tools or approaches at your disposal.”

In recent years, that imperative -- to solve the problems posed by big data -- has led data architects at many organisations on a journey of discovery. Put simply, the traditional database and business intelligence tools that they routinely use to slice and dice corporate data are just not up to the task of handling big data.

To put the challenge into perspective, think back a decade: few enterprise data warehouses grew as large as a terabyte. By 2009, Forrester analysts were reporting that over two-thirds of Enterprise Data Warehouses (EDWs) deployed were in the 1 to 10-terabyte range.. By 2015, they claim, a majority of EDWs in large organizations will be 100 TB or larger, “with petabyte-scale EDWs becoming well entrenched in sectors such as telecoms, financial services and web commerce.”

What’s needed to analyse these big data stores are special new tools and approaches that deliver “extremely scalable analytics,” said Kobielus. By “extremely scalable,” he means analytics that can deal with data that stands out in four key ways: for its volume (from hundreds of terabytes to petabytes and beyond); its velocity (up to and including real-time, sub-second delivery); its variety (diverse structured, unstructured and semi-structured formats); and its volatility (wherein scores to hundreds of new data sources come and go from new applications, services, social networks and so on).

Hadoop for Big Data on the Rise

One of the principal approaches to emerge in recent years has been Apache Hadoop, an open source software framework that supports data-intensive distributed analytics involving thousands of nodes and petabytes of data.

With a big data approach, it’s perfectly possible and relatively cost-effective to query data relating to all 5 million customers.
Mark SearsEMC Greenplum

The underlying technology was originally invented at search engine giant Google by in-house developers looking for a way to usefully index textual dataand other “rich”information and present it back to users in meaningful ways. They called this technology MapReduce – and today’s Hadoop is an open-source version, used by data architects for analytics that are deep and computationally intensive.

In a recent survey of enterprise Hadoop users conducted by Ventana Research on behalf of Hadoop specialist Karmasphere, 94% of Hadoop users reported that they can now perform analyses on large volumes of data that weren’t possible before; 88% said that they analyse data in greater detail; and 82% can now retain more of their data.

It’s still a niche technology, but Hadoop’s profile received a serious boost over that past year, thanks in part to start-up companies such as Cloudera and MapR that offer commercially licensed and supported distributions of Hadoop. Its growing popularity is also the result of serious interest shown by EDW vendors like EMC, IBM and Teradata. EMC bought Hadoop specialist Greenplum in June 2010; Teradata announced its acquisition of Aster Data in March 2011; and IBM announced its own Hadoop offering, Infosphere, in May 2011.

What stood out about Greenplum for EMC was its x86-based, scale-out MPP, shared-nothing design, said Mark Sears, architect at the new organization EMC Greenplum. “Not everyone requires a set-up like that, but more and more customers do,” he said.

“In many cases,” he continued “these are companies that are already adept at analysis, at mining customer data for buying trends, for example. In the past, however, they may have had to query a random sample of data relating to 500,000 customers from an overall pool of data of 5 million customers. That opens up the risk that they might miss something important. With a big data approach, it’s perfectly possible and relatively cost-effective to query data relating to all 5 million.”

Hadoop Ill Understood by the Business

While there’s a real buzz around Hadoop among technologists, it’s still not widely understood in business circles. In the simplest terms, Hadoop is engineered to run across a large number of low-cost commodity servers and distribute data volumes across that cluster, keeping track of where individual pieces of data reside using the Hadoop Distributed File System (HDFS). Analytic workloads are performed across the cluster, in a massively parallel processing (MPP) model, using tools such as Pig and Hive. Results, meanwhile, are delivered as a unified whole.

Earlier this year, Gartner analyst Marcus Collins mapped out some of the ways he’s seeing Hadoop used today: by financial services companies, to discover fraud patterns in years of credit-card transaction records; by mobile telecoms providers, to identify customer churn patterns; by academic researchers, to identify celestial objects from telescope imagery.

“The cost-performance curve of commodity servers and storage is putting seemingly intractable complex analysis on extremely large volumes of data within the budget of an increasingly large number of enterprises,” he concluded. “This technology promises to be a significant competitive advantage to early adopter organisations.”

Early adopters include daily deal site Groupon, which uses the Cloudera distribution of Hadoop to analyse transaction data generated by over 70 million registered users worldwide, and NYSE Euronext, which operates the New York stock exchange as well as other equities and derivatives markets across Europe and the US. NYSE Euronext uses EMC Greenplum to manage transaction data that is growing, on average, by as much as 2TB each day. US bookseller Barnes & Noble, meanwhile, is using Aster Data nCluster (now owned by Teradata) to better understand customer preferences and buying patterns across three sales channels: its retail outlets, its online store and e-reader downloads.

Skills Deficit in Big Data Analytics

While Hadoop’s cost advantage might give it widespread appeal, the skills challenge it presents is more daunting, according to Collins. “Big data analytics is an emerging field, and insufficiently skilled people exist within many organisations or the [wider] job market,” he said.

Skills-related issues that stand in the way of Hadoop adoption include the technical nature of the MapReduce framework, which requires users to develop Java code, and the lack of institutional knowledge on how to architect the Hadoop infrastructure. Another impediment to adoption is the lack of support for analysis tools used to examine data residing within the Hadoop architecture.

That, however, is starting to change. For a start, established BI tools vendors are adding support for Apache Hadoop: one example is Pentaho, which introduced support in May 2010 and, subsequently, specific support for the EMC Greenplum distribution.

Running Hadoop on EC2 is particularly appropriate where the data already resides within Amazon S3.
Marcus CollinsGartner

Another sign of Hadoop’s creep towards the mainstream is support from data integration suppliers, such as Informatica. One of the most common uses of Hadoop to date has been as an engine for the ‘transformation’ stage of ETL (extraction, transformation, loading) processes: the MapReduce technology lends itself well to the task of preparing data for loading and further analysis in both convential RDBMS-based data warehouses or those based on HDFS/Hive. In response, Informatica announced that integration is underway between EMC Greenplum and its own Data Integration Platform in May 2011.

Additionally, there’s the emergence of Hadoop appliances, as evidenced by EMC’s Greenplum HD (a bundle that combines MapR’s version of Hadoop, the Greenplum database and a standard x86 based server), announced in May, and more recently (August 2011), the Dell/Cloudera Solution (which combines Dell PowerEdge C2100 servers and PowerConnect switches with Cloudera’s Hadoop distribution and its Cloudera Enterprise management tools).

Finally, Hadoop is proving wellsuited to deployment in cloud environments, a fact which gives IT teams the chance to experiment with pilot projects on cloud infrastructures. Amazon, for example, offers Amazon Elastic MapReduce as a hosted service running on its Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3). The Apache distribution of Hadoop, moreover, now comes with a set of tools designed to make it easier to deploy and run on EC2.

Hadoop is the future of the EDW and its footprint in companies’ core EDW architectures is likely to keep growing throughout this decade.
James KobielusForrester

“Running Hadoop on EC2 is particularly appropriate where the data already resides within Amazon S3,” said Collins. If the data does not reside on Amazon S3, he said, then it must be transferred: the Amazon Web Services (AWS) pricing model includes charges for network costs and Amazon Elastic MapReduce pricing is in addition to normal Amazon EC3 and S3 pricing. “Given the volume of data required to run big data analytics, the cost of transferring data and running the analysis should be carefully considered,” he cautioned.

The bottom line is that “Hadoop is the future of the EDW and its footprint in companies’ core EDW architectures is likely to keep growing throughout this decade,” said Kobielus.

Read more on Data warehousing