DataStax: spreading Marmite on graph data theory

This is a guest post for the Computer Weekly Developer Network written by Patrick McFadin in his role as chief evangelist for Apache Cassandra at ‘commercial Cassandra’ company DataStax.

In my role, I have to travel a lot to spread the word around a popular NoSQL database platform. This travel means I get exposed to lots of new and interesting foods. On my last trip to the UK, I was exposed to Marmite. For those that are in the US, it’s a difficult taste to describe but “malty salty” is the closest term I can think of.

Aside from being an acquired taste – you either love the stuff, or want to kill it with fire and hammers – a little Marmite goes a long way.

Developers, Marmite & discreet objects


DataStax’s McFadin: “Yeah right, British food sure has improved, pass the Marmite.”

So why should developers think about Marmite today? Well, today’s applications have more data available to them, as well as creating more data that can be stored over time. Tools like Hadoop have made it easier to store those huge amounts of data over time, while cloud services from Amazon, Google and Microsoft have reduced the cost of storing data too.

So why do I argue that data should be treated more like Marmite i.e spread thinly?

One of the answers is latency. For cloud applications that can be accessed from anywhere, having all your data physically located in one geographic location can contribute to increased response times. Distributing data across multiple locations – now much simpler with cloud – can put data closer to end users, thus cutting out distance latency.

This approach works well with most data types – however, graph data is an exception to this. Graph data concentrates on the relationships that might exist between discrete objects and makes it easier to model those relationships in order to create valuable insights. Looking at links between objects – often termed edges – it’s possible to spot patterns much more easily than when using other methods of data modelling and management.

As more cloud applications look to solve business challenges around topics like fraud detection, social networking, buyer behaviour analysis or personalisation of services, the potential role of graph data grows too. For example, if I like Marmite, what else might I like?

However, that graph implementation has to be distributed in approach too. Previously, that was difficult to achieve at scale.

Today, more options for running distributed graph database implementations are available. This geographic spread of data – whether it is held in document store, column or graph database format – means that developers can help those applications to scale faster without having to sacrifice ease of management over time.

Turning chaos into order

Alongside this ability to distribute data, another area where developers have to consider their approach to data is around populating the app with data. For Internet of Things (IoT) applications, this is particularly obvious as there is sensor data to deal with from hundreds or potentially thousands of devices, or millions if you believe the predictions from Gartner et al. All of those devices have the potential to create and send data back to a central application.

So, what can you do with all of that data?

It has to be processed sequentially in order to be accurate and valuable. With so many small transactions or updates coming through, managing this can be extremely difficult, especially when working with an app infrastructure that can be distributed.

To overcome this, streaming messaging platforms can be used. This might look like a point of centralisation, but tools like Apache Kafka can take a distributed approach to delivering streams of data. When you link this to a distributed processing and storage, you can then design for full fault-tolerance across the application.

This approach is useful when looking at real-time services – where applications have to take data and then perform analytic functions close to the transaction taking place. A recommendation engine would be a good example here, as would fraud detection.

Testing times for distributed applications

As developers look at building new cloud applications based on these different approaches to handling data, there is one potential issue that can arise. The move to distributed computing – whether it’s using more public cloud, NoSQL databases, streaming data or a mix of all of them – does mean looking at the testing phase differently. A lot of the existing testing processes and collaboration steps that exist across the software creation process have traditional software products in mind.

Instead, application requests can run on multiple sections of the stack that is put together. It’s therefore critical to rethink testing processes so that the right effort can be spent on running tests. This includes looking at performance and failure modes across all the locations that the application might be hosted at.

Alongside this, looking at how data is managed in these distributed environments in practice can help avoid some of the issues that can crop up around data being saved and accessed in multiple locations.

Marmite, so good it now comes in lipsalve form

Marmite, so good it now comes in lip-salve form

One step to take is to make the testing and monitoring distributed as well. While producing a load on your systems, create random failures in the application stack. Understand how it fails or recovers and use that data to harden your systems. Alongside this, integrating distributed monitoring tools such as Zipkin can help ensure that you have visibility from test to production.

Taken together, there are many potential opportunities to make use of new and open source components within cloud applications based on distributed computing models. However, turning this potential into production-ready services that can scale requires forward planning, experience in how to design data schema and thoughts on how to put these applications together in the ways that suit both the end users and the developer.

In order to make the most of these new data models, building up skills will be essential. Getting more experience with NoSQL and streaming – unlike Marmite – is one area where you can’t get enough of a good thing.