Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support File Skipping #13

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

JHibbard
Copy link
Contributor

@JHibbard JHibbard commented May 17, 2023

You can make Delta Lake queries faster by using column projection and predicate pushdown. These tools accelerate reads and subsequent queries by reducing the amount of data being sent to the Ray cluster.

This PR adds the filters argument to the read_delta method within deltaray. The filters argument accepts a PyArrow dataset expression, documentation here. This can be used to query Delta tables with Ray while taking advantage of both column pruning and now predicate pushdown filters. Example below:

# Standard Libraries
from pathlib import Path

# External Libraries
import deltaray
import deltalake as dl
import pyarrow.compute as pc
import pandas as pd


# Create a Delta Table
cwd = Path.cwd()
table_uri = f'{cwd}/delta-table'
df = pd.DataFrame({
    'id': [0, 1, 2, ], 
    'name': ['Bill', 'Sue', 'Rose'],
})
dl.write_deltalake(table_uri, df)
for person in [{'id': 3, 'name': 'Jake', }, {'id': 4, 'name': 'Sally'}, ]:
    df = pd.DataFrame([person])
    dl.write_deltalake(table_uri, df, mode='append')

# Create a Filter
filters = (pc.field("id") > pc.scalar(3))
dataset = deltaray.read_delta(table_uri, filters=filters, columns=['name'])

If accepted, this PR will close issue 1 and support file skipping via filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Allow users to supply filters so file skipping is possible
1 participant