IoT Back to Basics, chapter 4: IoT projects risk failure without careful consideration of data management processes and analytics. Their ultimate goal after all is to glean valuable information from data coming from the ‘things’ on the network – sensors and smart devices – in order to act on it.
So I thought I’d look at some of the novel trends in in-memory data processing: in-memory databases as well as data fabrics and data streaming engines.
The use of memory in computing is not new. But while memory is faster than disk by an order of magnitude, it is also an order of magnitude more expensive. That has for the most part left memory relegated to acting as a caching layer, while nearly all of the data is stored on disk. However in recent years, the cost of memory has been falling, making it possible to put far larger datasets in memory for data processing tasks, rather than use it simply as a cache.
It’s not just that it is now possible to store larger datasets in memory for rapid analytics, it is also that it is highly desirable. In the era of IoT, data often streams into the data centre or the cloud – the likes of sensor data from anything from a production line to an oilrig. The faster the organization is able to spot anomalies in that data, the better the quality of predictive maintenance. In-memory technologies are helping firms see those anomalies close to, or in, real-time. Certainly much faster than storing data in a disk-based database and having to move packets of data to a cache for analytics.
I expect take-up of in-memory data processing to accelerate dramatically, as companies come to grips with their data challenges and move beyond more traditional data analytics in the era of IoT. In-memory databases are 10 to 100 times faster than traditional databases, depending on the exact use case. When one considers that some IoT use cases involve the collection, processing and analysis of millions of events per second, you can see why in-memory becomes so much more appealing.
There’s another big advantage with in-memory databases. Traditionally, databases have been geared toward one of two main uses: handling transactions, or enabling rapid analysis of those transactions – analytics. The I/O limitations of disk-based databases meant that those handling transactions would slow down considerably when also being asked to return the results of data queries. That’s why data was often exported from the transactional database into another platform – a data warehouse – where it could more rapidly be analyzed without impacting the performance of the system.
Hybrid operational and analytical databases
With in-memory databases, it’s becoming increasingly common for both operational and analytic workloads to be able to run in memory rather than on disk. With an in-memory database, all (or nearly all) of the data is held in memory, making reads and writes an order of magnitude faster – so much so that both transactional duties and analytic queries can be handled by the same database.
There are a number of in-memory database players vying for what has become an even more lucrative market in the era of IoT. The largest incumbent database vendors such as Oracle, IBM and Microsoft have added in-memory capabilities to their time-tested databases. SAP has spent many millions of dollars educating the market about the benefits of its in-memory HANA database, saying it will drop support for all other third party databases under its enterprise software by 2025. There are also smaller vendors vying for market share such as Actian, Altibase, MemSQL and VoltDB.
Data grids & fabrics
Then there is the in-memory data grid (sometimes known as a data fabric) segment. This is an in-memory technology that you ‘slide’ between the applications and the database, thereby speeding up the applications by keeping frequently-used data in memory. It acts as a large in-memory cache, but using clustering techniques (hence being called an in-memory grid) it’s possible to store vast amounts of data on the grid.
In recent years their role has evolved beyond mere caching. They still speed up applications and reduce the load on the database, and have the advantage of requiring little or no rewriting of applications, or interference with the original database. But now as well as caching, they are being pressed into action as data platforms in their own right: they can be queried (very fast, in comparison with a database), they add another layer of high availability and fault tolerance – possibly across data centers – and they are increasingly being used as a destination for machine learning.
There are data grid offerings from a handful of vendors, amongst them Oracle, IBM, Software AG, Amazon Web Services, Pivotal, Red Hat, Tibco, GigaSpaces, Hazelcast, GridGain Systems and ScaleOut Software.
Data streaming engines
The third category, streaming, is also notable in the context of the Internet of Things. Data streaming involves the rapid ingestion and movement of data from one source to another data store. It employs in-memory techniques to give it the requisite speed. Streaming engines ingest data, potentially filter some of it, and also perform analytics on it. They can raise alerts, help to detect patterns, and start to form a level of understanding of what is actually going on with the data (and hence with the sensors, actuators or systems that are being monitored).
While streaming was largely confined to the lowest-latency environments, such as algorithmic trading in the financial sector, more and more use cases in the IoT space are latency sensitive: e-commerce, advertising, online gaming and gambling, sentiment analysis and more.
There are relatively few vendors with data streaming technology. But they include IBM with Streams, Amazon Web Services’ Kinesis in the cloud, Informatica with its Ultra Messaging Streaming Edition, SAS’ Event Stream Processing (ESP), Impetus Technologies with its StreamAnalytix and also TIBCO, Software AG and SAP (which bought StreamBase Systems, Apama and Aleri, respectively).
Smaller competitors include DataTorrent, which has a stream processing application that sits on a Hadoop cluster and can be used to analyze the data as it streams in, and SQL-based event-processing specialist SQLstream. Another young company is Striim.
In the open source space, Apache Spark Streaming and Apache Storm both offer streaming – most vendors have added support for Spark rather than Storm. But that, as they say, is a story for another day.