Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Use Daft s3 credentials chain for deltalake reads #2486

Merged
merged 3 commits into from
Jul 8, 2024

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Jul 8, 2024

Before: Without user intervention (by passing in explicit credentials), Daft would pass in None for credentials when instantiating the DeltaTable objects, causing auth issues with S3.

After: We detect and special-case when there are no credentials being passed in, and try to use Daft's credentials chain to override the empty credentials before instantiating the DeltaTable

image

@github-actions github-actions bot added the bug Something isn't working label Jul 8, 2024
@jaychia jaychia requested a review from universalmind303 July 8, 2024 19:46
Copy link

codecov bot commented Jul 8, 2024

Codecov Report

Attention: Patch coverage is 66.66667% with 4 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@55fc032). Learn more about missing BASE report.
Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2486   +/-   ##
=======================================
  Coverage        ?   63.52%           
=======================================
  Files           ?      950           
  Lines           ?   106718           
  Branches        ?        0           
=======================================
  Hits            ?    67796           
  Misses          ?    38922           
  Partials        ?        0           
Files Coverage Δ
daft/delta_lake/delta_lake_scan.py 87.37% <66.66%> (ø)

@universalmind303
Copy link
Contributor

not directly related, but it may be worth mentioning in the docstrings that s3 object stores will try to obtain credentials from the env if not specified.

@jaychia
Copy link
Contributor Author

jaychia commented Jul 8, 2024

not directly related, but it may be worth mentioning in the docstrings that s3 object stores will try to obtain credentials from the env if not specified.

Good call.

If no creds are specified, Daft will attempt to obtain credentials on distributed workers (i.e. runs credentials discovery on each node, per-worker process!)

I think our recommendation is actually to do credential discovery on the driver, and then explicitly pass that into Daft so that we can skip credentials discovery on the worker nodes. We've had issues in the past where we bring down the IMDS metadata servers on the worker nodes because we're too aggressive with asking for credentials.

@jaychia jaychia merged commit 510786b into main Jul 8, 2024
45 checks passed
@jaychia jaychia deleted the jay/deltalake-use-s3-env branch July 8, 2024 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants