macrovector - Fotolia

Democratise internet of things data analytics with virtualisation

Accenture Labs authors contend that data virtualisation can unlock business value from the IoT

Data powers today’s digital business. It holds knowledge about the clients, transactions and products at the heart of a business. It is a resource that organisations compete over to unlock the promise of high-value analytics.

But it remains hard to derive value from enterprise data. Data use requires specialised skills to put it into a usable format, and this process is made more complex by the need to navigate a variety of data architectures. The creation of data-processing pipelines that bring together data for application use can take several weeks, if not months, to develop.

Typical enterprise data architectures comprise silos where data is distributed across a variety of data stores. For example, a retailer may have product catalogue, point-of-sale, e-commerce, supply chain and customer data, all in different stores.

Traditionally, data silos are created for organisational reasons where data is owned by different groups. Today, these silos are exacerbated by organisations' need to adopt technologies such as data lakes, NoSQL and in-memory to accommodate a variety of data formats and to meet bespoke performance requirements, such as the ability to scale the speed and volume at which to capture and process data.

Data-processing pipelines implement the extract-transform-load (ETL) functions that are needed to prepare data for application needs. For example, to implement daily reports that summarise sales numbers from both point-of-sale and e-commerce sites requires building an interface to extract the data from its respective stores, transform it into a consistent format for querying and analysis, and then load it into the report application.

Pipeline development is time-consuming and requires expertise from dedicated teams of IT and business intelligence (BI) personnel.

Adding new use cases requires modifying existing pipelines. For example, enhancing a sales report with customer demographics from a recent marketing campaign requires integrating the marketing and customer data. The implementation overhead means new hypotheses may be too costly to test, and even then the results may arrive too late to act upon.

Data lakes

Even the promise of the latest Hadoop-based data lake fails to fully address this need to simplify and accelerate data use. Data lakes strive to address this issue by bringing together a variety of data within the same system, but this approach is limited. Taking all sources on board is not feasible for reasons that include compliance, security, organisational and performance. The data lake is just another silo that needs to be integrated.

Also, the act of loading data within a data lake or into another staging area creates a copy of the original data that then needs to be synchronised. Otherwise, these versions evolve separately, presenting stale copies and multiple versions of data upon which decisions are made.

The result is that data's promise to produce new insights and support real-time decision-making is unfulfilled. Leveraging data across sources is costly and complex, and remains within the realm of dedicated specialists and IT projects. Data virtualisation democratises data access.

The success of a digital business relies on its ability to fully exploit its data assets. It requires the ability to quickly create new data pipelines, to add new sources, and to enable direct access to the original source of truth. Agility is key to proving a new hypothesis and capturing new opportunities – it must not take months or weeks, but needs to happen in days or even hours.

Most critical is the ability to open up data usage beyond IT and BI specialists, and to democratise all these features for data consumers, which include data scientists, business analysts and partners.

Data virtualisation

One approach that addresses this aim is data virtualisation. Virtualised or “logical” views present data from across data stores and representations in a single, query-able format. This abstracts away the details of the specific data stores to be accessed through a set of pre-built integrations and transformations. Creation of these views is accessible and uses standard SQL, a language more commonly understood by data scientists, BI analysts and IT practitioners than many of today’s architecture-specific protocols.

These views simplify data consumption by presenting data across sources in relation to each other as if in a single relational table. For example, a virtual view that joins up sales transactions with customer profile and loyalty information from separate data stores enables applications to pull together a particular customer’s profile information, as well as their past orders, by making a single query instead of having to deal with multiple data sources and formats.

The key feature of the data virtualisation layer is its ability to generate a query execution plan and optimise its processing for incoming application queries as they arrive, based on the details in the logical view. For each query, the virtualisation layer generates an execution plan that describes the order and mechanism to retrieve data and normalise its format across disparate sources, schemas and types (for example, CSV, JSON, Parquet, XML).

Read more about internet of things data analytics

Unlike full-blown ETL processes, the data stays in place within the source systems and is accessed in a dynamic fashion upon submitting a query to the abstraction’s single source of truth. No data is actually accessed when a virtual view is created or published, but is instead processed each time the views are queried or synchronised at specified trigger intervals, ensuring the freshness of data for business decisions.

Data virtualisation solutions have been around for more than 15 years, and there are both commercial and open-source options. Typical functionality includes query planning and optimisation algorithms, SQL predicate push down, and a pre-built set of connectors for integrating a variety of data sources out of the box. And, of course, it includes standard enterprise security features, such as authentication, authorisation and role-based access control.

Traditionally, when the number of data stores has been relatively constant, pipeline creation and integration could be done manually. But today’s data environments are more complex than ever, with disparate organisations, architectures and types of data. As the importance of digital grows as a means of differentiating a business through services powered by analytics and information, so does the importance of agility and interoperability to take advantage of new data sources, over which organisations compete.

Data virtualisation is a vital component of modern data architecture and offers a standard interface that scales pipeline creation by simplifying data integration and access. Virtualisation enables scalable self-service data exploration and analysis that democratises data use by analytics, business and IT users. 

Case study: Data virtualisation at scale of the IoT

We recently tested the viability of data virtualisation for managing data in digital platforms for the internet of things (IoT).

Accenture Technology Labs, along with Accenture Resources, has been creating digital platforms for the IoT that handles the large volumes of data created by a huge number of sensors and devices that must be processed both in batch and in real time. At its heart, the platform is a version of a Lambda architecture that comprises a number of data stores that handle the fast writes of streaming real-time data, the comprehensiveness of the massive quantity of data, and the service needs for application consumption.

For our first step in using data virtualisation, we focused on a version of Lambda that uses Cassandra for serving and capturing streaming data from sensors and devices, and Amazon S3 for comprehensive storage. We also selected Metanautix Quest, whose roots stem from Google’s Dremel project, as the data virtualisation solution to query these large datasets at the IoT scale because of its ability to scale both horizontally and vertically.

We had previously implemented manual pipelines for migrating, tracking and querying data across data stores. Migration followed a usual tiered storage approach, directing newly arrived hot data to the write-optimised operational store Cassandra, and later compressing and migrating that data into S3 for archival after a predetermined period of months.

Querying the data was either left to connecting to a single source (for example, connecting to Cassandra would be limited by the data contained within), or queries across stores required creating pipelines with custom logic to fetch and ingest data from each and then to join them.

Logical views

With the introduction of Metanautix, we started by replacing our existing pipelines with data virtualisation’s logical views. Out of the box, we used Metanautix’s existing connector and compression handling to access S3, its trigger-based actions for setting the migration. At the beginning, there was no out-of-the-box connector for Cassandra, so we used a user defined function (UDF) to create a call-out to store and fetch the data.

To a user, the complexity of the multiple data stores and formats sat behind a logical view that would fetch and unite the results from all sources. The consumer could query the data regardless of whether it resided on Cassandra or S3.

The ease of creating the pipelines allowed this use case to be set up in a matter of days, with the most complexity lying in the custom UDF. This set-up was a great way to quickly evaluate the ease of use and for a user to see the relations across the data.

But to scale it, we wanted to have a more optimised solution than a UDF and worked with Metanautix, which created a new native connector for Cassandra.  This Metanautix native connector allowed easy configuration of logical views using the standard SQL, which is familiar to a broad range of data users and can be done within minutes.

Performance time

The native connector also brought significant improvements in performance time. We were able to achieve reads of up to 130,000 rows per second. We reached this limit not because of Metanautix, but because we were taxing our single-node Cassandra instance running on Amazon Elastic Compute Cloud c3.2xLarge instance.

We also added Postgres and MySQL as alternative data stores to introduce relational stores for traditional BI reporting to the mix and to add new data to enhance the sensor readings in Cassandra and S3. With existing connectors for Postgres and MySQL, the change was again straightforward and now unlocked that data for use across the IoT platform.

Our first steps with data virtualisation for IoT have been very promising. IoT presents a use case that must take advantage of new data as it becomes available and requires ease of onboarding the associated data store technologies and to decouple the complexity for users. Data virtualisation is a key enabler of this capability and is required to scale the application of data.

Srinivas Yelisetty and Teresa Tung work in Accenture’s Technology Labs

Read more on Database management