Getty Images/iStockphoto

How Databricks is easing lakehouse adoption

Databricks is making it easier for organisations to adopt a data lakehouse architecture through support for industry-specific file formats, data sharing and streaming processing, among other areas

Migrating on-premise databases and data platforms to a cloud-based data lakehouse can be challenging for organisations that run multiple data management products on different systems to serve different needs.

But with a data lakehouse, an architectural approach that combines the best of data warehouses and data lakes, data is stored in a single storage system in one datacentre, with different applications sharing the same copy of the data.

Matei Zaharia, co-founder and chief technology officer at Databricks, said the challenges with moving to a data lakehouse architecture are often associated with change management, adding that Databricks has capabilities to address those challenges.

For example, organisations with distributed ownership of data can use Databricks’ Unity data catalogue to assign data owners for different kinds of data and manage the data in a distributed manner.

“We also have an open source feature called Delta Sharing that lets you share a table, or part of it, with another department, even on a different platform such as Amazon Elastic MapReduce. A lot of that is possible because it’s all based around open data formats,” he added.

Since the start of the year, Databricks has been introducing industry offerings that have the ability to understand very specific file formats. Zaharia said these capabilities were built on “solution accelerators” to help customers solve problems unique to their industry.

“In healthcare, for example, many organisations have electronic medical records and produce data in the same formats,” he said. “Finance is very similar – we recently announced an integration with Legend, an open source platform for modelling data and doing various computations in finance.”

With open source roots, Databricks was also behind the open source Delta Lake lakehouse project, which Zaharia said is aligned with the market preference for open formats. While he noted that “open source is a force we can’t ignore”, certain enterprise-grade features are only available through Databricks’ commercial software-as-a-service (SaaS) offerings.

These include serverless computing to support database queries with low latency, alleviating the need for organisations to maintain pools of servers to run queries at a higher cost, along with high availability.

Zaharia claimed that Databricks runs over 50 million virtual machines across the top three hyperscalers, more than any of the cloud providers’ own data services. “We run a larger workload than all of them, and we’re on all three clouds which is hard for people to replicate,” he said.

Integrations with analytics tools and business applications can be challenging if a data platform does not have the necessary connectors. To that, Databricks has integrations with tools like Fivetran for ingesting data, and for business applications, it has reverse extract, transform and load (ETL) tools that can publish data into those applications.

Zaharia said: “You can compute some data in the lakehouse and then publish into Salesforce or Workday and make it available to business users directly. We also have great integrations with the major BI [business intelligence] tools like Power BI and Tableau.”

Amid growing demand for real-time access to data, Databricks has been investing heavily in data streaming capabilities, such as Project Lightspeed for faster streaming processing with Apache Spark. It has also launched new data governance tools that will help organisations to manage data quality, access controls and privacy, as well as auditing and understanding how data is being used.

“We also have a data marketplace which has more than 1,000 companies providing data to each other using Delta Sharing,” Zaharia added. “We’re unique in the industry in that we’re creating an open sharing protocol so that many platforms can connect to us.”

Read more about data platforms in APAC

Read more on Data warehousing

Data Center
Data Management