Dremio: Understanding Apache Iceberg (the data lakehouse backbone)

This is a guest post for Computer Weekly Open Source Insider written by Alex Merced in his capacity as developer advocate (and data lakehouse evangelist) at Dremio – as explained here the Dremio platform enables users to use cloud data storage for data lakes, with the ability to organise and query the data for business intelligence, operations and data analytics.

Merced writes in full as follows…

As the amount of data grows along with use cases, no single platform can expand fast enough to be your data solution-for-everything.

That means there’s a need to store data in a way that allows several tools to efficiently work with it – and this is where data lake table formats and the data lakehouse architecture fit in.

Apache Iceberg is at the forefront of responding to this need and it’s vital for organizations to understand what this disruptive technology is.

But let’s just stand back for a moment and remind ourselves – what is a data lakehouse and why does it matter?

What is a data lakehouse?

A data lakehouse is an architectural paradigm that combines the best of both data lakes and data warehouses.

Traditional data warehouses come with structure and performance but lack flexibility – and data lakes are highly scalable and flexible but often suffer from performance issues and lack of schema enforcement. A lakehouse, on the other hand, aims to offer the best of both worlds. It offers a unified platform that supports business intelligence (BI) and machine learning (ML) on all of your structured and semi-structured data without the traditional drawbacks. This consolidation is significant as it promotes efficiency, reduces data silos, and supports both operational and analytical workloads.

So then to Apache Iceberg.

We need to know more about what Apache Iceberg is and how it enables a data lakehouse.

Apache Iceberg is an open source table format that provides a more efficient way to query large datasets in data lakes by adding a metadata layer for robust query planning. It ensures high performance, scalability and fine-grained data management, thus acting as a bridge in creating a data lakehouse. Iceberg provides a specification that any tools can use to support reads and writes to your data lake.

While data lakes traditionally store data in flat files with minimal structure, Iceberg introduces a well-defined table metadata structure atop this data, turning the lake into a performant analytical warehouse. By enabling features like ACID transactions, schema evolution and more, Iceberg ensures data reliability across the several tools that support it without sacrificing data lakes’ flexibility.

Iceberg’s key features

ACID Transactions: These ensure data integrity by supporting atomic commits.
Schema Evolution: This allows developers and data scientists to take actions that modify the data schema without breaking old data.
Fine-grained Partitioning: This provides efficient querying by organising data into smaller chunks based on content. Iceberg also has novel partitioning features in Partition Evolution and Hidden Partitioning.
First-class Deletes: These support upserts (update/insert) and deletes natively, ensuring timely and accurate data representation. Apache Iceberg also supports merge-on-read patterns for improving the performance of frequent updates.
Versioning: With this, users can efficiently retain old versions of data, providing a historical view and easy rollbacks. File, Table and Catalog level versioning are all possible with Apache Iceberg natively or with external tools.

What is Iceberg’s architecture?

Dremio’s Merced: His avatar isn’t real, but his understanding of data massiveness is.

The catalogue is a mechanism for tracking the different Iceberg tables in your lakehouse, along with a reference to the latest metadata file of the table. This allows reads to find the latest metadata and allows writes to determine if concurrent writes had completed to maintain ACID guarantees.

Metadata File: The root of a table’s metadata, this file tracks the table’s schema, partitioning, and other configurations. It points to manifest lists that provide a snapshot of the table’s current and historical state.

Manifest List: A list that points to all the manifest files in the table for a particular snapshot. The manifest list provides stats on each manifest so query engines can prune manifests from scanning based on metrics like partition values.

Manifest: This comprises a list of the actual data files (like Parquet or Avro) and metadata about them, such as added/deleted files, file path, partition, number of records, and more. A query engine can use this metadata to prune individual files from the query plans.

Iceberg transforms the data lake

An efficient and flexible data management system is paramount in the massive data proliferation era. The concept of a data lakehouse attempts to harness the power of both data lakes and data warehouses and Apache Iceberg plays a pivotal role in achieving this vision. With its architecture and features, Iceberg provides the structure and reliability required to transform a traditional data lake into a high-performance analytical workspace.

As data continues to grow and become an even more critical asset for businesses, technologies like Iceberg will undoubtedly shape the future of data storage and analysis.