Using big data and Hadoop 2: New version enables new applications
A comprehensive collection of articles, videos and more, hand-picked by our editors
The capabilities of Hadoop, the open-source technology stack for big data analysis, recently got a lot bigger.
With the general-availability release of Hadoop 2 in October 2013, developers can now perform a greater range of data-crunching tasks, beyond Hadoop’s previous scope of batch processing, directly within the stack itself. The new release also holds out the distinct advantage of enabling them to run multiple workloads on the same Hadoop cluster.
These improvements represent a “major advancement” of the stack, according to Merv Adrian, an analyst with IT market research company, Gartner. Hadoop 2, he says, offers new opportunities for information managers and addresses shortcomings in previous versions.
But, he adds, it will further complicate support models and supplier selection for customers. For those in the UK and Europe, where adoption lags the US by some distance, that raises a difficult question: is the world of Hadoop now running way ahead of its big data ambitions?
First, it’s important to understand how Hadoop can offer new capabilities over previous versions - and the best person to ask is Arun Murthy who, alongside his role as founder and architect at Hadoop distributor Hortonworks, also led the open-source effort to develop and release Apache Hadoop 2.
For more on Hadoop 2.0
“Hadoop 2 is not a release number,” he insists, “It’s a second-generation architecture.” The distinction is important, he adds, because the amount of re-engineering required to move Hadoop beyond batch processing and into the world of real-time analytics has been substantial.
New version, new capabilities
In short, an open-source development community led by Murthy has taken MapReduce, Hadoop’s programming framework for processing large data sets on a cluster of commodity servers, and broken it up into two areas of function: job processing and resource management.
In the new version, MapReduce 2 is responsible for job processing, running on top of a new layer in the Hadoop stack, YARN (Yet Another Resource Negotiator), which handles resource management. This reconfiguration means programmers can run multiple applications in Hadoop, including MapReduce for batch processing of data, all sharing common resource management, provided by YARN.
While Hadoop 1 was like running Microsoft Windows in order to run only Notepad, Hadoop 2 enables you to have Word, SharePoint, PowerPoint and Excel, too
Arun Murthy, founder, HortonWorks
That will make a huge difference to programmers, says Willem van Asperen, a data science specialist at management consultancy firm PA Consulting Group. “The old resource manager was tuned to batch jobs: you made sure all the data was available, you ran the job and downloaded the results,” he says.
“With YARN, you’ve got a far more open and flexible application-programming interface. This means that it is now easy for other frameworks to use the resource-management layer - not just the batch processing of MapReduce, but a whole host of online and immediate-results frameworks are underway. They run on top of Hadoop, but give the user the immediate response that is vital to many use cases.”
Or, to use an analogy employed by Arun Murthy: while Hadoop 1 was like running Microsoft Windows to run only Notepad, Hadoop 2 enables you to have Word, SharePoint, PowerPoint and Excel, too.
PA's van Asperen says changes to the Hadoop Distributed File System (HDFS) – also included in Apache Hadoop 2 – provide new failover capabilities, for better availability of the stack. “All of a sudden, Hadoop has become a platform for fault-tolerant, resilient, online big-data analysis - it’s a big step forward,” he says.
A matter of choice
So what applications or analytical workloads are customers up to speed on Hadoop 2 likely to choose – or find available?
According to Merv Adrian at Gartner, other processing engines to run on top of YARN may come from third parties or Apache projects, but are likely to include real-time event processing, graph processing, search and text indexing and in-memory processing.
But while this emergence of a "bring your own Hadoop" ecosystem, he adds, will expand the possibilities for using Hadoop in big data projects, it will also introduce complexities for users, “and demand new architectural and vendor-management thinking".
Hadoop has become a platform for fault-tolerant, resilient, online big-data analysis
Willem van Asperen, PA Consulting
In short, users have two possible scenarios to choose between: running components from a single supplier on their Hadoop cluster to perform different analytical workloads, or running components from multiple suppliers.
The first scenario has the advantages of tight integration between components and a single point of contact for support - but comes with obvious risks of vendor lock-in. That could be a problem, if one supplier does not provide the applications or analytical depth that a business use-case requires, he says.
The second scenario, meanwhile, introduces several different issues: integration of applications with the Hadoop platform may be relatively easy to overcome, given that Apache supplier’s codebase is publicly available via its open-source model, but dealing with multiple suppliers is likely to increase development and deployment times and increase training overhead.
Either way, these may not be issues that European organisations are ready to wrestle with just yet. The major Hadoop distributors have only opened offices in the region in the last two years and while a Europe-wide survey, published by big-data integration specialist Syncsort in July 2013, showed that 64% were experimenting with Hadoop or had been using it for over a year, it might be sensible to assume that respondents were drawn from organisations with a relatively more sophisticated understanding of big data than most. Hadoop adoption in the region, the survey admits, “is much more of a marathon than a sprint”.
Hadoop skills remain in short supply and come at a premium
A bigger question, perhaps, is that as they prepare for that marathon, to what extent have their "training programmes" taken Hadoop 2 into account, with all the technical changes and new deployment considerations that it involves?
For this article, four Hadoop distributors - Cloudera, Hortonworks, MapR (the three leading providers in Europe, according to Syncsort’s survey), along with WANdisco - were asked to provide details of customers in Europe already using Hadoop 2. They were unable to do so, but in fairness, that might be more of a reflection of customers’ willingness to talk publicly, rather than their actual use of, or experimentation with, the updated version.
But according to Eddie Short, head of data and analytics for Europe, the Middle East and Africa at management consultancy KPMG, many are still wrestling with Hadoop 1.
“The US is far ahead on this journey. The rest of the world, including Europe, has a long way to go. They’re already struggling to get to grips with what was already there in Hadoop 1, so I’m not sure that Hadoop 2’s had much of an impact here yet,” he says.
Among clients with whom he speaks regularly, he says, Hadoop skills remain in short supply and come at a premium, and most have yet to move beyond pilot-deployment mode.
But, as Arun Murthy of Hortonworks points out, “Different organisations want to do different things with data”, and Hadoop 2 does a much better job of supporting that.
There’s a huge amount of development work needed to add to Hadoop's features
Arun Murthy, Hortonworks
A credit card company, for example, might want to use Apache Storm for stream processing to look for patterns and anomalies in credit-card use in order to detect fraud in real time; it might wish to use Apache TEZ to run interactive SQL queries to see in what locations a cloned credit card was used; and it can still use batch processing to identify wider patterns across all customers and cards.
“And with YARN, they can do all that on one platform, without the need for independent, bolt-on systems for Hadoop that all need to be managed and monitored separately,” says Murthy.
Commercial availability of YARN-friendly applications will also be a factor in user adoption, Murthy concedes, but a lively ecosystem is already building. Take, for example, the Apache Giraph graph-processing application. It is already used by sophisticated Hadoop users, such as Facebook and LinkedIn, to reveal connections between the individuals who populate their social networking platforms, to reveal "who knows who".
YARN, he repeats, is much more than a new component for Hadoop. It’s a datacentre operating system. “There’s a huge amount of development work needed to add to its features, but what we’re providing here with Hadoop 2 is better for Hadoop, better for the ecosystem and better for the enterprise customer.”