Apache iceberg spark

12/7/2023

One of the best aspects of Iceberg is that so many tools are building support for Iceberg such as Dremio (which is also an Iceberg contributor). Check out their docs for many of the great features that exist in Iceberg such as Time Travel, Hidden Partitioning, Partition Evolution, Schema Evolution, ACID transactions and more. Now you know how to quickly set yourself up so you can experiment with Apache Iceberg.

If you want to use this container again in the future: To create a new Iceberg table we can just run the following command. So it may feel like working with a traditional database, and that is the beauty that table formats like Iceberg enable, working with files stored in our data lake in the same way we work with data in a database or data warehouse. So we are creating and reading files that would exist in your data lake storage (AWS/Azure/Google Cloud). Keep in mind, we are not working with a traditional database but with a data lakehouse. If you are curious to the settings I used you can run cat iceberg-init.bash back in terminal. Now we are inside of SparkSQL where we can run SQL statements against our Iceberg catalog that was configured by the iceberg-init script. Start the Docker Container docker run -it -name format-playground alexmerced/table-format-playground This blog will focus on Apache Iceberg, but feel free to play with the other table formats using their documentation. delta-init - to open Sparh Shell with Delta Lake configured.

hudi-init - to open Spark Shell with Apache Hudi configured.iceberg-init - to open Spark Shell with Apache Iceberg configured.Once the docker image is running you can easily open up Spark with any of the table formats with the following commands: All you have to do is rebuild the image, you can find the dockerfiles for this image in this repo. Note: This container was built from 64-bit Linux machine, so the image may not work on an M1/ARM chipset. You can get this up and running easily with the following command:Įnter fullscreen mode Exit fullscreen mode Blog: Table Format Comparison - Partitioningįor this tutorial you do need to have Docker installed, as we will be using this docker image I created for easy hands on experimenting with Apache Iceberg, Apache Hudi and Delta Lake.Blog: Table Format Comparison - Governance.Blog: Table Format Comparison (Iceberg, Hudi, Delta Lake).Blog: Migrating Apache Iceberg tables from Hive.Blog: Apache Iceberg's Hidden Partitioning.Blog: How maintain Apache Iceberg Tables.DataNation Podcast: Episode of Table Formats.

Meetup: Apache Iceberg and Architectural Look Under the Covers.
Meetup: Comparison of Data Lakehouse Table Formats.
Introduction to Table Formats and Apache Iceberg Before we get into our exercise, here is some content to help get you introduced to Apache Iceberg and the world of Data Lakehouse table formats. What I'd like to do today is show you how to very quickly get a docker container up and running to get hands on and try Apache Iceberg with Spark, do keep an eye out for an even more in-depth introduction on Subsurface. One of the major topics I've been diving deep into is the topic of Data Lakehouse Table Formats, these allow you to take the files on your data lake and group them into tables data processing engines like Dremio can operate on. As a Developer Advocate for Dremio I spend a lot of time doing research on technology and best practices around engineering Data Lakehouses and sharing what I learn through content for Subsurface - The Data Lakehouse Community.

0 Comments

Apache iceberg spark

Leave a Reply.

Author

Archives

Categories