Cloudera: we've seen the future... and it's mixed data workloads


Cloudera is an interesting company. Interesting in that it bills itself as a data management, analytics and machine learning specialist… three ‘disciplines’ that one might have expected to find in three different firms.

Given this supposed breadth then, the firm now welcomes Apache Kudu (as many readers will know, an open source storage engine for fast analytics on fast moving data) now shipping as a generally available component within the Cloudera Enterprise 5.10 version.

What is fast data?

All data is fast really… but we use the term to explain the notion that there is a time sensitivity to data in the first place i.e. we don’t want data to reside in the ‘data lake’ where it sits as unstructured and full of potential but essentially unused.

As TechTarget defines it, “The term fast data is often associated with self-service BI and in-memory databases.  The concept plays an important role in native cloud applications that require low latency and depend upon the high I/O capability that all-flash or hybrid flash storage arrays provide.”

Kudu simplifies the path to real-time analytics, allowing users to act quickly on data as-it-happens to make better business decisions.

Complex lambda architecture (mixed workloads)

“Real-time data analysis has been a challenge for enterprises because it required a complex lambda architecture to merge together real-time stream processing and batch analytics. Kudu eases that architecture with a single storage engine that addresses both needs,” said Charles Zedlewski, senior vice president of products at Cloudera. “The high-demand workloads in place today, which include a growing number of new machine-learning models, can identify cybersecurity threats, predict maintenance issues in the Industrial Internet of Things (IIoT), and bring much more accuracy to all types of online reporting.”

Kudu was designed to take advantage of hardware such as solid state storage, memory and more affordable RAM.

Further here… we know that Kudu is purpose-built for fast, large-scale analytic scans that also support rapidly updating data – necessary for handling time series data, machine data analytics, online reporting, or other analytic or operational workload needs.

I’ve seen the future… and it’s mixed data workloads

“Incorporating Apache Kudu into CDH will greatly simplify execution of the mixed workloads our customers increasingly utilise once they migrate their enterprise data warehouse and real-time streams to Hadoop. The Cloudera-certified StreamSets Data Collector natively supports Kudu as a plug-and-play dataflow destination, and StreamSets Dataflow Performance Manager helps assure the continuous availability and accuracy of the data flowing into Kudu,” said Arvind Prabhakar, chief technology officer at Streamsets.

For additional background here, back in September 2015, Cloudera announced the public beta release of Apache Kudu, and two months later, Cloudera donated Kudu to the Apache Software Foundation (ASF) to open it to the broader development community