Apache Spark speeds up big data decision-making

Spark, the open-source cluster computing framework from Apache, promises to complement Hadoop batch processing

There is more to big data than Hadoop, but the trend is hard to imagine without it. Its distributed file system (HDFS) is helping businesses to store unstructured data in vast volumes at speed, on commodity hardware at previously unimaginable costs.

But there are downsides. The MapReduce programming model that accesses and analyses data in HDFS can be difficult to learn and is designed for batch processing. This is fine if applications can wait for answers to analytical questions, but if time is important, MapReduce can hold them back.

Matt Aslett, research director for data platforms and analytics at 451 Research, says Hadoop has opened up opportunities for organisations to store and process data that had previously been ignored, but applications such as fraud detection, online advertising analytics and e-commerce recommendation engines need a more rapid turnaround from data to conclusion.

“Batch processing is OK, but if it takes an hour or two, it’s not great for these applications,” he says.

The technology that promises to overcome some of these problems is Spark, the open-source cluster computing framework from the Apache Software Foundation. “With Spark and in-memory processing, you can get the response down to seconds, allowing real-time, responsive applications,” says Aslett.

Huge interest in Spark

“Interest in Spark has been bubbling under for a while, but now there is huge interest. Part of that is because Hadoop providers are getting behind it, possibly to complement Hadoop batch processing, enabling more in-memory for real-time applications. Cloudera is an early company to push it and see it as potential long-term replacement for MapReduce.”

Spark was born out of a research project at the University of California Berkeley’s AMPLab. In 2009, then PhD student Matei Zaharia developed the code that went open source in 2010. In 2013, the project was donated to the Apache Software Foundation and switched its licence to Apache 2.0. 

In 2013, AMPLab recorded Spark running 100 times faster than MapReduce on certain applications. In February 2014, Spark became an Apache top-level project.

For more about Apache Spark

  • What is Apache Spark?
  • Databricks talks up a storm about its new Spark cloud offering in an effort to distinguish the data processing engine from MapReduce and the Hadoop stack
  • Nicole Laskowski on how data scientists look to Apache Spark to ask the really big data questions and get answers faster.

Spark was developed as part of the Berkeley Data Analytics Stack, enabled by the Yarn resource manager accessing HDFS data. It can also be used on file systems apart from HDFS.

But there is cause for corporate users to be cautious of Spark, steeped as it is in open source.

Chris Brown, big data lead at high-performance computing consultants OCF, says: “Big data is still a new concept and we’ve never come across a customer that asked us to do anything with Spark.

“There are a couple of issues. Firstly, Hadoop is still immature: there are not millions of customers, there are thousands. Secondly, open-source projects like to move on quickly, whereas businesses want production environments to be stable and not change things at the same rate.”

Nonetheless, Spark is finding a home alongside proprietary software. Postcodeanywhere, a provider of address data to popular e-commerce and retail websites, has been using Spark internally for more than a year to help understand and predict customer behaviour on its platform, enabling the company to improve service.

Open-source projects like to move on quickly, whereas businesses want production environments to be stable

Chris Brown, OCF

Spark’s speed and flexibility make it ideal for rapid, iterative processes such as machine learning, which Postcodeanywhere has been able to exploit (see panel below).

Chief technology officer Jamie Turner says Postcodeanywhere’s main services are built on a Microsoft .Net framework, and incorporating open-source code took a while to get used to.

“This is our first foray into anything open source,” he says. “You tend to see quite a lot of volatility in code base. You see bugs coming in and then disappearing between different distributions.

“We knew that, for what we wanted, SQL systems would not work economically, in terms of licences, and technically, in terms of scale. But open-source technology is not well documented. What you save in licensing costs, you spend in man-power trying to understand it.”

Machine-learning capability

Postcodeanywhere is now developing its internal use of Spark’s machine-learning capability as a service for clients wanting to better understand and predict customer behaviour.

Although the performance benefits of Spark in-memory are not disputed, not all applications run 100 times faster. OCF uploaded the Hermann Hesse novel Siddartha on both HDFS and Spark in-memory to compare the time to count the words in the 700MB file. Hadoop was able to complete the task in 686 seconds; Spark could do it in 53 seconds, or 13 times faster.

Spark can help where they want to build full data transformation and write code to build a large-scale data model

Sean Owen, Cloudera

But Spark does not only offer performance benefits in-memory. Disk-based analysis is also improved dramatically. In autumn 2014, Databricks, founded by Spark’s initial developers, broke the world record in sorting 100TB of data on disk. 

Spark used 206 Amazon Elastic Compute Cloud machines to complete the task in 23 minutes. The previous 72-minute record was set by Hadoop MapReduce using 2,100 machines. Spark was three times faster using one-tenth of the machines.

This has implications for the way data scientists work in business. Spark not only enables applications to perform analytics in-memory at a faster rate, it can transform the productivity of data scientists querying data and building algorithms from disk-based data. 

Sean Owen, director of data science at Cloudera, says: “It makes it easier for data scientists building operational systems. Spark can help where they want to build full data transformation and write code to build a large-scale data model. It is a much better substrate than what we had a year or so ago.”

Zubin Dowlaty, vice-president, head of innovation and development at big data consultancy Mu Sigma, says that although IT departments can use Spark and HDFS to work on new data sets with new programming tools, the hardest thing to change is the mindset.

“They need to shake themselves up with respect to the agility these new tools provide,” he says. “They need to shake up the business to think bigger. Computation can really scale now, so you can do much more. But the leadership is not coming from the CTO, it is coming from the CMO or other c-suite executives.”

Underpinning in-memory analytics and machine learning while boosting the productivity of data science, Spark promises low-cost assistance for IT departments grappling with big data problems. The question is: will they be able to help their business peers understand the value of these new applications and approaches to data science?

Case study: Postcodeanywhere ready to launch machine-learning tools based on Spark

Address validation service Postcodeanywhere knows it could have a problem at the heart of its business model. Essentially, the service uses freely available data to help e-commerce websites identify customers’ locations and help to auto-complete address forms.

CTO Jamie Turner says: “We are selling someone else’s data, which could, in theory, become a commodity. The difference with us is the quality of service and usability of the technology.”

Founded 15 years ago, Postcodeanywhere is still trying new ideas to improve its service. One of them is to predict when customers will have a service issues, and to take pre-emptive action.

Developed in-house, the system is written in Scala, a Java-like language, and exploits Spark machine learning using NoSQL database Cassandra and Elasticsearch queries, all running in Windows. The front end is written in C# on Microsoft’s .Net framework.

By capturing that data, you can start to find out what is significant without making an early determination based on what your thoughts are.
Jamie Turner, Postcodeanywhere

Postcodeanywhere tries to capture as much information as possible about customers and their experience using the service. This includes account information, usage variations, errors, support cases augmented with other data such as IP addresses, indicating locations, as well as the mobile network carrier, network connection and customer operating systems.

By constantly analysing this data, the machine-learning model figures out which factors, in combination, could signal a negative experience.

“By capturing that data, you can start to find out what is significant without making an early determination based on what your thoughts are,” says Turner.

Predict a bad experience

Building these models to predict when a customer likely to have a bad experience allows Postcodeanywhere to take steps to improve the situation. This could mean presenting a customer with new content, placing a new product on a particular page, reducing pricing, or a proactive service call to the customer.

“It offers enormous benefits and makes a real difference to us,” says Turner. “It gives us a strong heads-up to where things are going well and where they are not. At moment it is about defending against possible problems.”  

Postcodeanywhere is developing the system as a commercial product, called Triggar, designed to help companies better understand and predict their customer experiences. This could assist with defensive action, as it is currently used, and even pin-point opportune moments for upselling.

Such an approach would be impractical without Spark, says Turner. “It is unique in giving us the scale and the speed that we want. Other approaches are either slower, or going to cost more.”

The system runs on racks of commodity Dell hardware and currently works through a terabyte of data every few days.

Turner adds: “It’s not insubstantial, but not really big data. We are looking at bigger engines like Cassandra, Elasticsearch and Spark so it can scale to petabytes, because that is what likely to happen as soon as a project really gets cracking.”

Read more on Big data analytics

Data Center
Data Management