Apache Spark grows in popularity as Hadoop-based data lakes fill up

Apache Spark is growing in popularity and finding real-time use across Europe, including in online betting and on railways, and with Hadoop

At Sky Betting and Gaming (Sky Bet), promotions are the key to bringing in punters keen to place a stake on a winning team or horse, to try their luck at poker, or to go “eyes down” in a game of bingo.

A well-timed offer of a free bet, or an invitation to join a “no-lose” promotion that guarantees a payout regardless of whether they win or lose goes a long way with customers, improving the rate at which Sky Bet – which has offices in Rome and Guernsey, as well as the UK – can acquire, engage and retain them. But matching the right offer to the right customer shouldn’t be a game of chance, according to Andy Walton, the company’s head of data.

“We spend a frightening amount each year on free bets and bonuses for our customers, right across our sports betting and gaming products, but especially in gaming. Promotions drive that business, whereas sports betting is more driven by the sporting calendar,” he says. The more accurately that Sky Bet can target promotions, he says, the better the returns it can expect from that spend.

Walton and his team look to open-source big data framework Apache Spark for answers. Right now, he says, the company’s legacy promotions systems fall into two camps. On one hand, there are those that offer promotions in more or less real time, but don’t know a lot about individual customers, “so they can’t make precise decisions about who to offer promotions”. On the other, there are those that know a lot about customers, but are based on batch processing “so you can’t offer the promotion in a timely manner”.

Walton hopes to bridge that gap by building a real-time, rules-based promotion engine in Apache Spark, running on the company’s Cloudera-based Hadoop infrastructure. This will enable Sky Bet to combine in-depth information about individual customers, such as the customer segment to which they belong, the bets or games they prefer, and their overall value to the company, with real-time activity information that shows what they’re doing on the platform at any particular time, “so we can make some data-rich decisions about what promotion to offer to them right now”. Sky Bet's Italian product is launching later this year, supported from the Rome office.

Apache Spark, with its Spark Streaming capability (an API extension of the core Spark API) wasn’t the only Apache framework Sky Bet looked at it. The company also considered Apache Storm and Apache Samza, but settled on Spark partly because of its close integration with Cloudera’s Hadoop distribution, and partly because of the strides that Spark has made in the last year. In other words, Walton and his team know that they’ll be able to tap into a wealth of advice and support for this project, both from Cloudera and from the wider Spark community. “Spark is just better known and more people are using it,” he says.

Spark spikes in popularity

That certainly seems to be true. In November last year, Cloudera product marketing manager Alex Gutow blogged that Spark had become one of the most popular Apache Software Foundation (ASF) projects and was attracting 50% more activity than the core Apache Hadoop project itself. And at the Spark Summit in June this year, an update by executives from Databricks, the company behind Spark, put the number of contributors to the project at more than 1,000.

How that translates into enterprise deployments is rather less clear: most companies are tight-lipped about their explorations of Apache Spark and the technology itself is only a little over three years old in any case. What’s clear is that Spark Streaming’s ability to process real-time data is a big drive for the framework’s overall success: according to Gutow’s blog, it’s already being used by financial services companies for fraud detection, healthcare companies for predicting sepsis occurrences in patients, in retail for managing inventory and in advertising to assess the performance of individual online ads in real time.

Spark 2.0, announced in late July after some months of tech previews, adds a new Structured Streaming API that brings interactive SQL querying to the Spark Streaming library. This offers the promise of integrating real-time and batch processing far more closely. For example, developers could add real-time processing to update frequently queried tables on a more continuous basis, or add real-time extensions to batch jobs. Previously, it was necessary to keep batch and real-time processing pretty separate. Now, with a single API, they can be united in a single process.

Spark requires storage

But there’s still a major role for Hadoop in all this. An important aspect of Spark deployment is that the technology does not provide a distributed storage system. That’s key, because distributed storage is what allows vast, multi-petabyte datasets to be stored across clusters of low-cost commodity servers. So any company wanting to use Spark must also implement a scalable and reliable information storage layer – which, in many cases, is proving to be the Hadoop Distributed File System (HDFS).

For more about Apache Spark on Hadoop

The younger, nimbler Spark technology looks set to replace MapReduce in big data architectures. What is the pace, scope and scale of replacement?

How the relationship between Spark and Hadoop will play out is an open question. We asked IT pros whether they see Spark more as a Hadoop companion or competitor

Spark seems to be growing beyond Hadoop, as standalone instances outnumber Spark on Yarn on HDFS

“Spark adoption is proceeding aggressively in the Hadoop space”, says a recent report by Carl Olofson, an analyst at IT market research company IDC. In a March 2016 survey of more than 200 IT professionals, just 37% of Hadoop users indicated that they use only MapReduce or its related processing techniques (such as Hive, for example), to handle data in their Hadoop implementations, while more than half (53%) are using Spark. “Most of those started with MapReduce, but are now adopting Spark. The rest are migrating older workloads to Spark, or using Spark exclusively,” Olofson says.

“While there has been every indication in the Hadoop community that Spark is the preferred analytic environment for managing Hadoop data, these results suggest that users are moving more rapidly to Spark than is common for an emerging technology. This is significant, because Spark is still rapidly evolving, and so committing to Spark involves accepting a considerable degree of ongoing rework,” he says, adding, “the popularity of Spark suggests that MapReduce, as the primary vehicle for managing Hadoop data, is on the decline.”

Either way, many large multinational firms have devoted time and money to establishing a Hadoop-based “data lakes” view of Apache Spark as a way to keep milking previous investments.

Rick Farnell is co-founder and senior vice-president of international operations at Think Big Analytics, a big data systems integrator that was acquired by data warehousing company Teradata in September 2014. Think Big’s customers include Siemens and disk drive manufacturer HGTS (part of Western Digital).

Eliano Marques, the firm’s principal data scientist, describes a recent Spark-on-Hadoop project carried out on behalf of a European railway network operator. “We were focusing on collecting data from switches in the network that close and open to allow trains to pass along the correct route. Every day, that company generates millions of data point from trains crossing switches, relating to the speed, age and weight of trains, as well as data relating to the switches themselves: how often they open and close, how long it takes them to do so, the angles they reach.”

Using Spark on the company’s Hadoop infrastructure, Think Big was able to demonstrate how that information might be used to predict switch failures. “If you’re able to make sense of that data, you can go back to the operations team with a report that shows exactly which switches are likely to fail in, say, the next 48 hours. A maintenance team can then be sent out to fix any problems. Okay, there will sometimes be false reports and a cost related to that, but if you’re correct in the prediction with sufficient frequency, there’s a huge payback to be had.”

Meanwhile, at Sky Bet, Andy Walton is hoping for a payback, too, but stresses the company’s roll-out of its Spark-based real-time promotions engine will be iterative. “What we’re trying to establish is the capabilities that the platform needs to deliver, generically, and then the use cases that will be of most value to the business,” he says.

“We’re picking a path through in which we identify, say, three capabilities that we can use to unlock a given use case. If we introduce a fourth and a fifth capability, we can unlock further use cases, and so on. That will mean that as we iteratively build this platform and its capabilities, we can deliver new value every step of the way. It’s not really a project, as such – it’s more an ongoing programme of work, but I reckon over time we’ll be delivering incremental value faster and faster.”

The company has already done a lot of work on developing its skills base in preparation, according to Sky Bet data team tester Alex Rolls. With the implementation of the Hadoop platform, it has established a good base of knowledge in managing distributed systems and Linux skills. From a development perspective, Spark requires good Scala skills, he adds, “but we’re finding that Java developers make the jump to Scala fairly easily.” Experience of messaging systems, such as RabbitMQ and Kafka, are also useful.

While Sky Bet is targeting promotions to begin with, Walton can already see that the ability to process real-time events using Spark Streaming could also be useful in detecting fraud and handling customer support issues. “From a Hadoop perspective, this is a new exploitation phase for us. We’re still early on in our work with Spark, and it won’t be without its challenges, but we’re pretty excited about what we can do with it.”  

Read more on Data warehousing