Databricks hoists mainsail on flagship open source projects

Data and AI company Databricks has announced several contributions to popular data and AI open source projects including Delta Lake, MLflow and Apache Spark.

At the Data + AI Summit, the largest gathering of the open source data and AI community, Databricks announced that the company will contribute all features and enhancements it has made to Delta Lake to the Linux Foundation and open source all Delta Lake APIs as part of the Delta Lake 2.0 release.

In addition, the company announced MLflow 2.0, which includes MLflow Pipelines, a new feature to help ML model deployments.

The company also introduced Spark Connect, to enable the use of Spark on [virtually] any device and Project Lightspeed, a Spark Structured Streaming engine for data streaming on the lakehouse.

Delta Lake 2.0 will help query performance for Delta Lake users and enable everyone to build a data lakehouse on open standards.

With this contribution, Databricks customers and the open source community will benefit from the full functionality of Delta Lake 2.0. The Delta Lake 2.0 Release Candidate is now available and is expected to be fully released later this year.

The Delta Lake ecosystem is a community of over 6,400 members, with contributing developers from more than 70 contributing organisations.

“From the beginning, Databricks has been committed to open standards and the open source community. We have created, contributed to, fostered the growth of, and donated some of the most impactful innovations in modern open source technology,” said Ali Ghodsi, co-founder and CEO of Databricks. “Open data lakehouses are quickly becoming the standard for how the most innovative companies handle their data and AI. Delta Lake, MLflow and Spark are all core to this architectural transformation, and we’re proud to do our part in accelerating their innovation and adoption.”

MLflow set a standard for ML platforms. The release of MLflow 2.0 introduces MLflow Pipelines to the platform, decreasing time to production and improving execution at scale through standardisation.

MLflow Pipelines

MLflow Pipelines offers data scientists pre-defined, production-ready templates based on the model type they’re building to allow them to bootstrap and accelerate model development without requiring intervention from production engineers.

“The Delta Lake project is seeing phenomenal activity and growth trends indicating the developer community wants to be a part of the project. Contributor strength has increased by 60% during the last year and the growth in total commits is up 95% and the average lines of code per commit is up 900%. We are seeing this upward velocity from contributing organisations like Uber Technologies, Walmart and CloudBees, Inc., among others,” said executive director of the Linux Foundation, Jim Zemlin.

As a unified engine for large-scale data analytics, Spark scales to handle data sets of all sizes. However, the lack of remote connectivity and burden of applications developed and run on the driver node, hinder the requirements of modern data applications.

To tackle this, Databricks introduced Spark Connect, a client and server interface for Apache Spark-based on the DataFrame API that will decouple the client and server for better stability and allow for built-in remote connectivity. With Spark Connect, users will be able to access Spark from any device.

Project Lightspeed

In collaboration with the Spark community, Databricks also announced Project Lightspeed, the next generation of the Spark streaming engine.

As the diversity of applications moving into streaming data has increased, new requirements have emerged to support the most in-demand data workloads for lakehouse, data streaming. Spark Structured Streaming has been widely adopted since the early days of streaming because of its ease of use, performance, large ecosystem, and developer communities.

With that in mind, Databricks will collaborate with the community and encourage participation in Project Lightspeed to improve performance, ecosystem support for connectors, enhance functionality for processing data with new operators and APIs, and simplify deployment, operations, monitoring and troubleshooting.