Hazelcast: the rise of the 'data worker' & the In-Memory Data Grid

Open source in-memory data grid (IMDG) company Hazelcast has integrated with Apache Spark, the open source data processing engine.

What is an IMDG?

But back to basics for a moment, what is an IMDG anyway, really?

Essentially then, in an IMDG we see data distributed (that’s why it’s called a grid, obviously) and stored across multiple servers each of which must reside in an ‘active’ mode state such that the data itself resides in main memory in a world where the ‘data model’ in deployment is most typically object-oriented and non-relational.

So then, an IMDG is all about scalability (you can reduce or increase the numbers of servers in the grid) and main memory access (for speed and performance) for what is these days very typically big data analytics related functions.

Turbo access to iterative data

Hazelcast has blended its IMDG approach with Apache Spark in a kind of IMDG booster turbo approach that is designed to enables data workers to  execute streaming, machine learning or SQL workloads which require fast iterative access to datasets.

The firm argues that by combining the two technologies, data developers can now go beyond “the historical limitations” (so says Hazelcast) of a single Java Virtual Machine (JVM).

Resilient Distributed Dataset (RDD)

One of the key driving forces behind the widespread developer adoption of Apache Spark is it’s easy-to-use Application Programming Interfaces (APIs) for operating on large datasets, one of which is Resilient Distributed Dataset (RDD). At the core, an RDD is an immutable distributed collection of elements of data, partitioned across nodes in a cluster to provide fault tolerance and parallel access to data.

“Both of these key features are a natural fit with Hazelcast, as they’re essential building blocks for any performant distributed compute capability,” says the firm.

The data worker ‘must-have’

Greg Luck, CEO of Hazelcast has said that the feedback his team gets from the community is that any big data solution needs to be able to distribute processing and storage across machines whilst maintaining a flexible and convenient programming interface.

“Without these functionalities, it becomes impossible to build enterprise applications which are expected to process more and more data,” said Luck.

Demo application

To demonstrate the potential of integrating Apache Spark into a Hazelcast IMDG application, BetLeopard, an example sports betting application, has been developed. Put simply, BetLeopard is a bet engine that scales across multiple JVMs with the sharing of events via Hazelcast IMDG partitions, with a query engine that uses Spark to provide real-time risk and analytics of future events.

The combination of Hazelcast’s advanced in-memory compute capabilities and distributed store, alongside Spark’s query and analytics capabilities creates a powerful gaming solution. In addition, the integration provides a solid base for the next generation of JVM applications.

The code is available at https://github.com/hazelcast/betleopard.