tsach - Fotolia
The bubble of hype surrounding Apache Hadoop – that darling of big data enthusiasts such as Amazon and Yahoo – may be about to burst.
According to a recent report by analyst Gartner, investment in the open-source environment – which enables the distributed processing of large datasets across commodity computing clusters – remains “tentative” in the face of what Gartner describes as “sizeable challenges around business value and skills”.
A survey of 284 of Gartner's large Research Circle members, comprising IT and business leaders, found 26% were experimenting with, piloting or all-out deploying Hadoop, while a further 18% planned to do so in the next two years.
Some 57% cited a lack of skills as the biggest inhibitor to adoption, while another 49% were unclear about how to gain value from the system. This meant that, in many cases, organisations simply did not consider implementing it as a corporate priority, while others considered it overkill for their business problems.
So what has gone wrong – and if Hadoop is not set to take over the world any time soon, is it doomed to simply fade into obscurity? Or does it have a niche role to play in a world increasingly beguiled by the possibilities of big data?
Chris Brown, big data lead at high-performance computing consultancy OCF, believes one of the issues is that the technology is simply not suitable for most organisations unless they are processing vast amounts of data – at least 1TB.
“The lighthouse accounts for this are Amazon, Yahoo and Walmart, which are huge corporations – but we just don’t have that many in the UK apart from a few telcos, retailers or financial services organisations,” he says.
“So, for the small and medium-sized companies that are in the majority here, it’s huge overkill and is just too big an exercise for them to take on.”
Another issue is generating a return on investment (ROI) – a situation exacerbated by the skills shortage Gartner cites, as it inevitably makes expertise expensive to buy in.
But such skills are rare since the “data scientist” role comprises an amalgam between “an engineer and a statistician”, which amounts to a “new, hybrid breed of person”, says Euan Robertson, chief technology officer at data analytics consultancy Aquila Insight.
Read more about Hadoop 2.0
- Find out how Hadoop 2.0 is evolving.
- An essential guide to big data and Hadoop in use.
- Hadoop is finding uses as a batch-processing engine and big data landing pad. Version 2 heralds wider applications, but new users face challenges.
A further plank in generating ROI is ensuring that use cases based on industry and customer requirements can be justified.
“Hadoop is great if you have 100 million active users at a time, like King’s Candy Crush. Or if you’re a bank wanting to analyse your customer records. Or you’re wanting to ask ‘what is the sentiment on Twitter about us at the moment?’,” says Robertson.
“If you want to process more than 160GB per day, your relational database model will start creaking, and the same applies if you’re processing disparate sets of unstructured data such as free text, call logs or tweets.”
But again, Hadoop is not always the best technology, even when processing big data. In some instances, users are starting to replace it with Apache Spark, which provides an alternative to MapReduce, the framework early versions of Hadoop were tied to.
“Hadoop is very much for a class of problems that run overnight, but if you’re wanting to do something on an intraday basis you’d use Spark, which is part of the same ecosystem but does analysis in real time,” he says.
Hadoop, on the other hand, is more like a batch-processing model. “Or long-term memory, where you build up experiences over a long time, whereas Spark would be more short-term memory.”
OCF’s Brown agrees. “Hadoop was very important historically, because it opened up a new line of thinking. Before it came along, although people did some analytics and big data, it wasn’t high on most agendas. But the technology is now starting to be superseded, in some cases, by other things like Spark, which does a slightly different job but faster,” he concludes.
Case study: JustGiving
“Although it wouldn’t have been impossible to do what we wanted without Hadoop, I wouldn’t like to think what the cost and resource requirement would have been,” says Mike Bugembe, chief analytics officer at online fundraising platform JustGiving.
The organisation first started working with the technology in early 2013 when it introduced a proof of concept, running on Microsoft’s Azure cloud environment, to enable it to start up and close down Hadoop clusters in line with demand.
The aim of the initiative was to analyse the decision-making process people engage in when making donations, to “understand the barriers to generosity” and make the experience less transactional and “more meaningful and engaging”, says Bugembe.
Mike Bugembe, JustGiving
To this end, the organisation developed a special algorithm to determine what characteristics people tend to display when they care about a given charity, as well as the optimum time to interact with them during the donation process.
The system was set up to analyse transactional data from 23 million people around the world who had given money to about 20 million charities over the course of 14 years, and the relationships between them.
“Looking at relationships not only increases size, but we’re also no longer talking about structured data – it’s a graph with in excess of 80 million nodes, dealing with 285 million relationships,” says Bugembe. “So if you’re trying to do calculations across the graph quickly over a short period, SQL databases just won’t cut it.”
The initiative resulted in the creation of a new social media part of the JustGiving platform to encourage interaction between those raising money for a cause and those giving to them via now standard tools such as "like", "share" and "care" buttons.
Bugembe says the service is now generating many more return visits. What's more, 16% of visitors now go on to make a donation, compared with e-commerce averages in the low single digits.
But he acknowledges that it will take time to generate a return on investment on the system, not least because “finding individuals with all of the skills required is incredibly difficult” – and costly.
Such skills include understanding how to manipulate data and statistics, knowledge of machine learning and application development expertise. While JustGiving bought in such expertise in some areas, it trained others in-house. But the most important factor was that they “all had more than a passing interest in the other fields” and were willing to learn, says Bugembe.
The Hadoop team now comprises 14 staff out of an organisation of 160, some 40% of whom are technology workers.
But JustGiving is also currently testing Spark, with the aim of introducing it to work alongside Hadoop. At the moment, Hadoop undertakes a nightly batch process to update all of the graphs 80 million nodes but, when analysing data in real time, it handles only a subset of the data to make it more manageable.
“But Spark will allow us to do more streaming real-time calculations – and, for us, the combination of the two would really add value,” says Bugembe. “We don’t need to do everything in real time, but it would give us the flexibility to use Python and R. I don’t think one will supersede the other though as they live very well together.”
Case study: Postcode Anywhere
“Hadoop will die and it will probably be a relatively slow, painful death. But it will go because, although it solved certain problems using big, distributed computers, it’s not especially good,” says Jamie Turner, chief technology officer of Postcode Anywhere.
The organisation, which was set up in 2001 and provides customers such as Tesco, Man City Football Club and Fiat with cloud-based address management services, started evaluating the offering about 18 months ago.
The initial supposition had been that Hadoop could form the basis of a new product to model behaviour and mood that had been originally developed to run on a much smaller scale in-house, but which the firm had decided to scale up and commercialise through a startup company called Triggar.
The aim was to create an offering that could respond to changes in online behaviour to help retailers turn more visitors into customers. Although currently in closed beta, the launch could take place as early as autumn 2015.
Jamie Turner, Poscode Anywhere
“Our use case was machine learning, which means processing a lot of data with a lot of maths on top,” explains Turner. “But it was a classic case where Hadoop was not good at all. It processes things based on disk as a simple way to deal with it, but it makes it horribly slow, especially on memory-intensive exercises like this.”
Other challenges included tools that were not particularly easy to work with and which, on the development side, tended to have limited expressivity.
Choosing Spark over Hadoop
“We bailed with Hadoop quickly, but my early impressions were that it seemed really complicated and not massively sophisticated under the hood,” says Turner.
As a result, the decision was made to go with new kid on the block, Spark, instead. Turner explains the rationale: “Spark is good at micro-batch jobs. It’s better at job management, has a better way of recovering from a failed node and is 100 times faster than Hadoop, so it was a no-brainer for us.”
Other advantages of Spark are that it is “a bit easier” to set up, but also seems to have a “more supportive” open-source community around it, which “makes more contributions, is more active and has some good financial backing behind it”, he adds.
On the downside, however, it is just as difficult to find suitable Spark skills as it is to find Hadoop ones. “Big data comes with a big headache as there are a lot of moving parts in the infrastructure and the skills side of things is a constant battle,” says Turner. “The number of people who understand this stuff and can manage it is rare and they tend to be gobbled up by big players such as the banks, which are always ahead of the curve and will pay silly money for them.”
Nonetheless, he believes Spark is already taking over from Hadoop as the big data processing vehicle of choice.
“You could say that Spark is Hadoop version 2 because it fixes many of its fundamental problems. They do the same job, but with Spark, it’ll run faster on less hardware, you’ve got a fighting chance of knowing what to do if things go wrong and you have a better chance of fixing it. So it’s better from every angle,” Turner concludes.