This is a guest blog by Michael Hausenblas, Chief Data Engineer at MapR Technologies.
According to Gartner’s Hype Cycle, the Internet of Things (IoT) is supposed to peak in 2014. This sounds like a good time to look into opportunities and challenges for data processing in the context of IoT.
So, what is IoT in a nutshell? It is the concept of a ubiquitous network of devices to facilitate communication between the devices themselves, as well as between the devices and human end users. We can group use cases along the scope of an application, from Personal IoT (focus is on a single person, such as quantified self) over Group IoT, which sets the scope on a small group of people (e.g. smart home to Community IoT, usually in the context of public infrastructure such as smart cities and finally the Industrial IoT, one of the most mature areas of IoT, dealing with apps either within an organization (smart factory) or between organizations (such as retailer supply chain).
It is fair to say that the data IoT devices generate lends itself to the ‘Big Data approach’, that is, using scale-out techniques on commodity hardware in a schema-on-read fashion, along with community-defined interfaces, such as Hadoop’s HDFS or the Spark API. Now, why is that so? Well, in order to develop a full-blown IoT application you need to be able to capture and store all the incoming sensor data to build up the historical references (volume aspect of Big Data). Then, there are dozens of data formats in use in the IoT world and none of the sensor data is relational per se (variety aspect of Big Data). Last but not least, many devices generate data at a high rate and usually we cope with data streams in an IoT context (the velocity aspect of Big Data).
Before we go into architectural considerations, let’s have a look at common requirements for an IoT data processing platform:
· Native raw data support. Both in terms of data ingestion and processing, the platform should be able to natively deal with IoT data.
· Support for a variety of workload types. IoT applications usually require that the platform supports stream processing from the get-go as well as deal with low-latency queries against semi-structured data items, at scale.
· Business continuity. Commercial IoT applications usually come with SLAs in terms of uptime, latency and disaster recovery metrics (RTO/RPO). Hence, the platform should be able to guarantee those SLAs, innately. This is especially critical in the context of IoT applications in domains such as healthcare, where people’s lives are at stake.
· Security & Privacy. The platform must ensure a secure operation. Currently, this is considered to be challenging in an end-to-end manner Last but not least, the privacy of the users must be warranted by the platform, from data provenance support over data encryption to masking.
Now, we come back to the architectural considerations. While there are no widely accepted references architectures yet, a number of proposals exist. All of them have one thing in common, though, which can be summarised in the term polyglot processing. This is the concept of combining multiple processing modes (from batch over stream to low-latency queries) within a platform; two of the more popular and well-understood example architectures in this context are Nathan Marz’s Lambda Architecture and Jay Kreps’ Kappa Architecture.
With this we conclude our little excursus into data processing challenges and opportunities in the context of the Internet of Things and we’re looking forward to a lively discussion concerning the requirements and potential reference architectures.
About the author
Michael Hausenblas is the Chief Data Engineer for MapR. His background is in large-scale data integration, the Internet of Things, and web applications.