-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ADLSGen2-HOWTO.md #560
Conversation
I had a hard time figuring out how to connect to a delta table that is stored in ADLS Gen2 and only found a way by digging into the source code. I would like to save other people the same trouble by adding this to the docs.
|
||
delta = DeltaTable("adls2://<accountname>/<filesystem>/<path to table>") | ||
dataFrames = delta.to_pyarrow_table().to_pandas() | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Would you like to also add this to the usage docs as well?
It looks like I might have gotten the wrong prefix when I add this 🤦:
https://github.com/delta-io/delta-rs/blame/main/python/docs/source/usage.rst#L59-L60
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use this non-standard uri schema for ADLS?
Should either use abfss if the underlying storage layer "knows" adls or use https with the appropriate blob storage endpoint that underlies every adls account.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, googling a bit, it seems like abfss
is somewhat standard: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri
Thanks @zeevm
@dgcaron Does that schema work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
abfs[s]
refers to the Hadoop File System driver (see the docs linked by @wjones127), which is not being used here.
Using https
URLs would be inconsistent with the fact that delta-rs accepts URIs with a cloud specific scheme ("s3://" and "gs://") for the Amazon and Google backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use this non-standard uri schema for ADLS? Should either use abfss if the underlying storage layer "knows" adls or use https with the appropriate blob storage endpoint that underlies every adls account.
delta-rs relies on the atomic rename functionality of ADLS Gen2 and cannot work with "plain" Blob Storage.
Abfss was what I tried initially, unsuspecting users that will not read the manual will most likely do the same. After failing and not finding the right scheme I went into the code to see what it expects. To save anyone else the search I thought it was a good thing to add it to the docs. |
@dgcaron do you want to include the python usage docs update in this PR as well? |
Yes sure, I'll do that today
…On Wed, 23 Feb 2022, 08:09 QP Hou, ***@***.***> wrote:
@dgcaron <https://github.com/dgcaron> do you want to include the python
usage docs update in this PR as well?
—
Reply to this email directly, view it on GitHub
<#560 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACKP6IA6CLTL5J7B54HC6EDU4SB2BANCNFSM5OSXT7VQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I'd think fixing the docs isn't the issue, the implementation should be fixed to use either abfss if delta-rs knows how to talk to the adls endpoint or https if it talks to the blob storage endpoint, 'adls2' is never the right scheme to use |
Hi @zeevm, I addressed all these points above, stated again here with some more elaboration:
Based on those two points, it makes sense to use a cloud specific URI scheme in delta-rs, but not Also: the delta-rs Azure backend relies on the atomic rename functionality of Azure Data Lake Storage (ADLS) Gen2 and doesn't work with Azure Blob Storage. The |
did i break the build here? |
Clippy errors can show up because of toolchain updates. Take a look at the error, it has a suggestion to try and fix the clippy error. My PR #556 has the same clippy error so I will be fixing it there if you want to wait for my merge. |
yeah, i'll wait for your merge as it is a linter error for files I didn't touch at all. |
Actually, @houqp already fixed these yesterday. Just fetch upstream on your branch. |
do i need to do something additional to get this merged? |
No, someone with merge privileges has to review, approve, and merge. |
Sorry, there were some intermittent CI errors, I was waiting for the rerun of the CI before doing a merge. Thanks again @dgcaron for the fix! And than kyou @thovoll and @wjones127 for the review :) |
Thanks everyone! @dgcaron @wjones127 @houqp |
Description
I had a hard time figuring out how to connect to a delta table that is stored in ADLS Gen2 and only found a way by digging into the source code. I would like to save other people the same trouble by adding this to the docs.
Related Issue(s)
Documentation