Support File Skipping #13

JHibbard · 2023-05-17T05:55:16Z

You can make Delta Lake queries faster by using column projection and predicate pushdown. These tools accelerate reads and subsequent queries by reducing the amount of data being sent to the Ray cluster.

This PR adds the filters argument to the read_delta method within deltaray. The filters argument accepts a PyArrow dataset expression, documentation here. This can be used to query Delta tables with Ray while taking advantage of both column pruning and now predicate pushdown filters. Example below:

# Standard Libraries
from pathlib import Path

# External Libraries
import deltaray
import deltalake as dl
import pyarrow.compute as pc
import pandas as pd


# Create a Delta Table
cwd = Path.cwd()
table_uri = f'{cwd}/delta-table'
df = pd.DataFrame({
    'id': [0, 1, 2, ], 
    'name': ['Bill', 'Sue', 'Rose'],
})
dl.write_deltalake(table_uri, df)
for person in [{'id': 3, 'name': 'Jake', }, {'id': 4, 'name': 'Sally'}, ]:
    df = pd.DataFrame([person])
    dl.write_deltalake(table_uri, df, mode='append')

# Create a Filter
filters = (pc.field("id") > pc.scalar(3))
dataset = deltaray.read_delta(table_uri, filters=filters, columns=['name'])

If accepted, this PR will close issue 1 and support file skipping via filters.

…d test for file skipping

JHibbard added 2 commits May 16, 2023 22:39

Adding file skipping, updated README, bumped version number, and adde…

988d73b

…d test for file skipping

Merge remote-tracking branch 'origin' into file-skipping

0c0897d

JHibbard mentioned this pull request May 17, 2023

Allow users to supply filters so file skipping is possible #1

Open

JHibbard requested a review from dennyglee May 17, 2023 06:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support File Skipping #13

Support File Skipping #13

JHibbard commented May 17, 2023 •

edited

Loading

Support File Skipping #13

Are you sure you want to change the base?

Support File Skipping #13

Conversation

JHibbard commented May 17, 2023 • edited Loading

JHibbard commented May 17, 2023 •

edited

Loading