Laurentiu Iordache - stock.adobe

Spark Summit 17: Databricks launches Delta as purified data lake

At Spark Summit Europe 2017 in Dublin, organising supplier and Spark inventor and distributor Databricks announced the Delta system, which will combine data lakes and data warehouses

Databricks, the inventor and commercial distributor of the Apache Spark processing platform, has announced a system called Delta, which it believes will appeal to CIOs as a data lake, a data warehouse and a “streaming ingest system”. It is said to eliminate the need for extract, transform and load (ETL) processes.

The supplier’s CEO and co-founder, Ali Ghodsi, made the announcement at the Spark Summit in Dublin.

Databricks Delta will be a component of the supplier’s Unified Analytics Platform that runs in the cloud. Databricks said in a statement that with Delta, “enterprise organisations no longer need complex, brittle extract, transform and load processes that run across a variety of systems and create high latency just to obtain access to relevant, business-critical data.”

In an interview at the conference with Computer Weekly, Ghodsi said: “Delta is essentially a data lake that has the capability of data warehousing. It also stores extra ‘control’ information with the data it puts in the system – statistical information about the data itself.

“This can be useful for when you start asking questions of the data. It makes that analysis faster. We also validate that the data is correct when it comes into the data lake. Otherwise you store up problems for the future. For example, if Celsius values change to Fahrenheit [in a data store].”

Ghodsi said the company started working on Delta a year and a half ago in response to customer problems dealing with multiple data warehouses and data lakes.

“We created Spark to simplify this stuff, and found we hadn’t. Our customers were telling us their data warehouses were performant, but expensive. And their data lakes were full of junk. So, we went back to the drawing board, rather than continuing to patch things up incrementally,” he said.

In a statement released for the conference, he said: “Delta combines the reliability and performance of data warehouses with the scale of data lakes and low-latency of streaming systems. With this unified management system, enterprises now benefit from a simplified data architecture, up to 100x increase in query performance, and faster access to relevant data.”

Read more about about Spark

In the same statement, one customer, Greg Rokita, executive director of technology at US car shopping website, said: “Obtaining real-time customer and revenue insights is critical to our business. But we’ve always been challenged with complex ETL processing that slows down our access to data.

“Delta allows us to overcome this roadblock by blending the performance of a data warehouse with the scale and cost-efficiency of a data lake,” added Rokita.

Talking to Computer Weekly, Yonatan Aharon, engineering manager of data platform at Berlin-based travel tours information website GetYourGuide, said: “To me, Delta will be a data warehouse using Spark and Databricks.

“In a data lake, data is often unclean and unstructured. We want to serve our business users data that is clean, structured and fast performing. That would be a huge step forward,” said Aharon. At present, GetYourGuide is still using a Postgres database for data warehousing.

Delta is said to deliver a “unified data management system [that] simplifies pipelines by allowing Delta tables to be used as a data source and sink”, as well as “automating the compaction of small files for efficient reads” and “intelligent data skipping and indexing”.

The system stores all its data in Amazon S3, and the company said it can be accessed from any Spark application running on the Databricks platform through the standard Spark application programming interfaces (APIs).

According to DataBricks, Delta also integrates into the Databricks Enterprise Security model, including cell-level access control, auditing and HIPAA-compliant processing. Data is then stored inside customer’s own cloud storage account “for maximum control”.

Read more on Artificial intelligence, automation and robotics