docs: delta lake arrow integration page (delta-io#1914)

# Description Adds a documentation page on the Delta Lake Arrow integration.
ion-elgreco · Dec 1, 2023 · d966178 · d966178
1 parent 5c90586
commit d966178
Show file tree

Hide file tree

Showing 2 changed files with 109 additions and 0 deletions.
diff --git a/docs/integrations/delta-lake-arrow.md b/docs/integrations/delta-lake-arrow.md
@@ -0,0 +1,108 @@
+# Delta Lake Arrow Integrations
+
+Delta Lake tables can be exposed as Arrow tables and Arrow datasets, which allows for interoperability with a variety of query engines.
+
+This page shows you how to convert Delta tables to Arrow data structures and teaches you the difference between Arrow tables and Arrow datasets.
+
+## Delta Lake to Arrow Dataset
+
+Delta tables can easily be exposed as Arrow datasets.  This makes it easy for any query engine that can read Arrow datasets to read a Delta table.
+
+Let's take a look at the h2o groupby dataset that contains 9 columns of data.  Here are three representative rows of data:
+
+```
++-------+-------+--------------+-------+-------+--------+------+------+---------+
+| id1   | id2   | id3          |   id4 |   id5 |    id6 |   v1 |   v2 |      v3 |
+|-------+-------+--------------+-------+-------+--------+------+------+---------|
+| id016 | id046 | id0000109363 |    88 |    13 | 146094 |    4 |    6 | 18.8377 |
+| id039 | id087 | id0000466766 |    14 |    30 | 111330 |    4 |   14 | 46.7973 |
+| id047 | id098 | id0000307804 |    85 |    23 | 187639 |    3 |    5 | 47.5773 |
++-------+-------+--------------+-------+-------+--------+------+------+---------+
+```
+
+Here's how to expose the Delta table as a PyArrow dataset and run a query with DuckDB:
+
+```python
+import duckdb
+from deltalake import DeltaTable
+
+table = DeltaTable("delta/G1_1e9_1e2_0_0")
+dataset = table.to_pyarrow_dataset()
+quack = duckdb.arrow(dataset)
+quack.filter("id1 = 'id016' and v2 > 10")
+```
+
+Here's the result:
+
+```
+┌─────────┬─────────┬──────────────┬───────┬───────┬─────────┬───────┬───────┬───────────┐
+│   id1   │   id2   │     id3      │  id4  │  id5  │   id6   │  v1   │  v2   │    v3     │
+│ varchar │ varchar │   varchar    │ int32 │ int32 │  int32  │ int32 │ int32 │  double   │
+├─────────┼─────────┼──────────────┼───────┼───────┼─────────┼───────┼───────┼───────────┤
+│ id016   │ id054   │ id0002309114 │    62 │    95 │ 7180859 │     4 │    13 │  7.750173 │
+│ id016   │ id044   │ id0003968533 │    63 │    98 │ 2356363 │     4 │    14 │  3.942417 │
+│ id016   │ id034   │ id0001082839 │    58 │    73 │ 8039808 │     5 │    12 │ 76.820135 │
+├─────────┴─────────┴──────────────┴───────┴───────┴─────────┴───────┴───────┴───────────┤
+│ ? rows (>9999 rows, 3 shown)                                                 9 columns │
+└────────────────────────────────────────────────────────────────────────────────────────┘
+```
+
+Arrow datasets allow for the predicates to get pushed down to the query engine, so the query is executed quickly.
+
+## Delta Lake to Arrow Table
+
+You can also run the same query with DuckDB on an Arrow table:
+
+```python
+quack = duckdb.arrow(table.to_pyarrow_table())
+quack.filter("id1 = 'id016' and v2 > 10")
+```
+
+This returns the same result, but it runs slower.
+
+## Difference between Arrow Dataset and Arrow Table
+
+Arrow Datasets are lazy and allow for full predicate pushdown unlike Arrow tables which are eagerly loaded into memory.
+
+The previous DuckDB queries were run on a 1 billion row dataset that's roughly 50 GB when stored as an uncompressed CSV file.  Here are the runtimes when the data is stored in a Delta table and the queries are executed on a 2021 Macbook M1 with 64 GB of RAM:
+
+* Arrow table: 17.1 seconds
+* Arrow dataset: 0.01 seconds
+
+The query runs much faster on an Arrow dataset because the predicates can be pushed down to the query engine and lots of data can be skipped.
+
+Arrow tables are eagerly materialized in memory and don't allow for the same amount of data skipping.
+
+## Multiple query engines can query Arrow Datasets
+
+Other query engines like DataFusion can also query Arrow datasets, see the following example:
+
+```python
+from datafusion import SessionContext
+
+ctx = SessionContext()
+ctx.register_dataset("my_dataset", table.to_pyarrow_dataset())
+ctx.sql("select * from my_dataset where v2 > 5")
+```
+
+Here's the result:
+
+```
++-------+-------+--------------+-----+-----+--------+----+----+-----------+
+| id1   | id2   | id3          | id4 | id5 | id6    | v1 | v2 | v3        |
++-------+-------+--------------+-----+-----+--------+----+----+-----------+
+| id082 | id049 | id0000022715 | 97  | 55  | 756924 | 2  | 11 | 74.161136 |
+| id053 | id052 | id0000113549 | 19  | 56  | 139048 | 1  | 10 | 95.178444 |
+| id090 | id043 | id0000637409 | 94  | 50  | 12448  | 3  | 12 | 60.21896  |
++-------+-------+--------------+-----+-----+--------+----+----+-----------+
+```
+
+Any query engine that's capable of reading an Arrow table/dataset can read a Delta table.
+
+## Conclusion
+
+Delta tables can easily be exposed as Arrow tables/datasets.
+
+Therefore any query engine that can read an Arrow table/dataset can also read a Delta table.
+
+Arrow datasets allow for more predicates to be pushed down to the query engine, so they can perform better performance than Arrow tables.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -37,6 +37,7 @@ nav:
     - api/schema.md
     - api/storage.md
   - Integrations:
+    - Arrow: integrations/delta-lake-arrow.md
     - pandas: integrations/delta-lake-pandas.md
 not_in_nav: |
   /_build/