Workshop: Build Portable Data Lake

In this hands-on course, you'll learn how to create a basic yet functional portable data lake that sidesteps traditional cloud vendor locks.

With open-source technologies like Iceberg, Delta, and DuckDB at the forefront, we'll explore the power of portable data runtimes, embedded catalogs and cloud-agnostic compute solutions.

We’ll evaluate our alternatives and discuss existing industry limitations and why we chose the solution implemented.

We will then walk you through building a portable data lake from scratch, while understanding the trade-offs of using open-source tools in real-world scenarios.

What's covered?

In this workshop, you'll get hands-on experience with a variety of powerful open-source tools that will empower you to build your data lake.

You'll learn about the current state of the industry and how to sidestep the current limitations.
We will compare our options building with Iceberg, Delta, or different stacks altogether.
Finally, we will choose a stack that's not currently vendor locked and build a functional portable data lake.
With dlt, parquet and DuckDB we will manage our data loading and storage.
Explore using Ibis as an embedded catalog and explore the benefits of this approach.
We explore how Polars fits in this stack to accelerate data exploration.
Finally, we will explore how to make this data accessible to other compute engines.

Materials

Video recording: https://youtube.com/live/qQ5kQPI1xSM?feature=share
Google Colab: https://colab.research.google.com/drive/1mZGJGDZ7cOmYmQRuewEHnGnPgoqZOh51?usp=drive_link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Workshop: Build Portable Data Lake

What's covered?

Materials

Files

README.md

Latest commit

History

README.md

File metadata and controls

Workshop: Build Portable Data Lake

What's covered?

Materials