Databricks styles itself as an analytics software company with a ‘unified’ approach – the unification is this sense is supposed to suggest that the software can be applied across a variety of datasets, exposed to a variety of programming languages and work with a variety of call methods to extract analytical results from the data itself.
This UAP (as in universal analytics platform) product also does the job of data warehouse, data lake and streaming analytics in one product, rather than three different ones.
So, more unifying unity, essentially… and that unification spans three basic data domains.
- Data warehouse – trusted data that has to be reliable, kept for years – but is difficult to get everyone using
- Data lake – great for storing huge volumes of data, but data scientists are typically the only ones able to use it
- Data that exists in streaming analytics engines – great for working on data mid-stream – can’t do the jobs of the other products.
It also unifies the data into one place so you can make it easier for data engineering, data science and business teams to all get what they need out of big data.
Ground level definitions out of the way, what has Databricks been doing to add to unified utopia?
The company has this month announced Apache Spark open-source cluster-computing framework. 2.3.0 on Databricks’ Unified Analytics Platform. This means that the company is the vendor to support Apache Spark 2.3 within a compute engine, Databricks Runtime 4.0, which is now generally available.
In addition to support for Spark 2.3, Databricks Runtime 4.0 introduces new features including Machine Learning Model Export to simplify production deployments and performance optimizations.
“The community continues to expand on Apache Spark’s role as a unified analytics engine for big data and AI. This is a major milestone to introduce the continuous processing mode of Structured Streaming with millisecond low-latency, as well as other features across the project,” said Matei Zaharia, creator of Apache Spark and chief technologist and co-founder of Databricks. “By making these innovations available in the newest version of the Databricks Runtime, Databricks is immediately offering customers a cloud-optimised environment to run Spark 2.3 applications with a complete suite of surrounding tools.”
The Databricks Runtime, built on top of Apache Spark, is the cloud-optimised core of the Databricks Unified Analytics Platform that focuses on making big data and artificial intelligence accessible.
In addition to introducing stream-to-stream joins and extending new functionality to SparkR, Python, MLlib and GraphX, the new release provides a millisecond-latency Continuous Processing mode for Structured Streaming.
Instead of micro-batch execution, new records are processed immediately upon arrival, reducing latencies to milliseconds and satisfying low-level latency requirements.
This means that developers can elect either mode—continuous or micro-batching—depending on their latency requirements to build real-time streaming applications with fault-tolerance and reliability guarantees.
The new model export capability also enables data scientists to deploy machine learning models into real-time business processes.
There is, arguably, unification here of many types and at many levels.