Data Insights
Article

Data Lake / Lakehouse Guide: Powered by Data Lake Table Formats (Delta Lake, Iceberg, Hudi)

Simon Späti
August 25, 2022
15 min read
Limitless data movement with free Alpha and Beta connectors
Limitless data movement with free Alpha and Beta connectors
Introducing: our Free Connector Program ->

Data Lake Table Format Comparison: Delta Lake vs Apache Hudi vs Apache Iceberg

Delta Lake has the most stars on GitHub and is probably the most mature since the release of Delta Lake 2.0. Apache Iceberg and Hudi have much more diverse GitHub contributors than Delta, which is around 80% from Databricks. 

Hudi has been open-source the longest and has the most features. Iceberg and Delta have great momentum with the recent announcements, Hudi provides the most conveniences for the streaming processes, and Iceberg supports most integrations with data lake file formats (Parquet, Avro, ORC) 

A comprehensive overview of read/write features from Onehouse.ai:

A comprehensive overview of Data Lake Table Formats Read/Write Features by Onehouse.ai (reduced to rows with differences only)
A comprehensive overview of Data Lake Table Formats Read/Write Features by Onehouse.ai (reduced to rows with differences only)

And data lake table services comparison by Onehouse.ai:

A comprehensive overview of Data Lake Table Formats Services by Onehouse.ai (reduced to rows with differences only)
A comprehensive overview of Data Lake Table Formats Services by Onehouse.ai (reduced to rows with differences only)

Please check the full article Apache Hudi vs. Delta Lake vs. Apache Iceberg for fantastic and detailed feature comparison, including illustrations of table services and supported platforms and ecosystems. Two other excellent ones are Comparison of Data Lake Table Formats by Dremio and Hudi, Iceberg and Delta Lake: Data Lake Table Formats Compared by LakeFS.

🔗 Interesting comment around Hudi Versioning where Hudi supports different source systems and how it’s based on commits and can be maintained for individual source systems. 

Data Lake Trends in the Market

The market of open-source data lake table formats is hot with the recent announcements at the Snowflake Summit and Data & AI Summit. Snowflake and Databricks announced a significant step with the Apache Iceberg Tables (Explainer Video), combining the capabilities of open-source Apache Iceberg with Apache Parquet. And Databricks with open-sourcing all of Delta Lake, including previous premium features such as OPTIMIZE and Z-ORDER with Delta Lake 2.0.

Other market trends are further commercializing the data lake table formats, such as Onehouse for Apache Hudi and both Starburst and Dremio coming out with their Apache Iceberg offerings. In April, Google announced BigLake and Iceberg support earlier this year, but it also supported Hudi and Delta now.

There is a big run for data lake table formats; every big vendor is either having one themself or searching for the perfect open-source one. By now, you should also understand why. Good for us is that all of these technologies are getting built on open-source data lake file formats (Apache Parquet, ORC, Avro), which is excellent news for us all.

For Example, All Features Are Open-Sourced with Delta Lake 2.0
For Example, All Features Are Open-Sourced with Delta Lake 2.0

How to Turn Your Data Lake into a Lakehouse

An essential part of a data lake and lakehouse is data governance. Governance is mainly around data quality, observability, monitoring, and security. Without it, you'll directly move towards a Data Swamp. 

Data Governance is a big thing at larger companies. In that case, the lakehouse implementations and features are helping here. These focus on reliability and strong governance and have more integrated features. But much data governance also sets the right processes and access rights in place. Let cross-functional teams work together with data quickly and in a transparent way.

To summarize essential parts so far, extending from the simple S3 storage to a full-fledged data lakehouse, you can follow these steps:

  1. Choose the suitable data lake file format
  2. Combine the above with the data lake table format you want to use that supports your use-case best
  3. Choose a cloud provider and storage layer you want to store the actual files in 
  4. Build some data governance on top of your lakehouse and inside your organization.
  5. Load your data into the data lake or lakehouse

ℹ️ Alternatives or when not to use a data lake or lakehouse: If you need a database. Don’t use JSON instead of a Postgres-DB. You can leverage Data Virtualizations technologies when you need a quick and fast way of querying multiple data sources without moving data.

Wrapping Up

In this article, we learned the difference between a data lake and a data lakehouse. What the market is doing in 2022 and how to turn the data lake into a data lakehouse. The three levels of it with the storage layer, the data lake file format, and the data lake table formats are on top with powerful features, which open-source table formats are out there with Apache Hudi, Iceberg, and Delta Lake. 

Another question is how to get the data inside my data lake or lakehouse. We at Airbyte can support you with our 190+ Source Connectors integrating your data. Suppose you want to build a data lake hands-on with the following step-by-step. In that case, we have two tutorials, one on Building an Open Data Lakehouse with Dremio and another with the ed Delta Lake table format Loading Data into a Databricks Lakehouse and Running Simple Analytics.

If you enjoyed this blog post, you might want to check out more on Airbyte’s Blog. You can also join the conversation on our Community Slack Channel, participate in discussions on Airbyte’s Discourse, or Sign Up for Our Newsletter. Furthermore, if you are interested in Airbyte as a fully managed service, you can try Airbyte Cloud for free!

The data movement infrastructure for the modern data teams.
Try a 14-day free trial