Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ADLSGen2-HOWTO.md #560

Merged
merged 3 commits into from
Mar 7, 2022
Merged

Update ADLSGen2-HOWTO.md #560

merged 3 commits into from
Mar 7, 2022

Conversation

dgcaron
Copy link
Contributor

@dgcaron dgcaron commented Feb 16, 2022

Description

I had a hard time figuring out how to connect to a delta table that is stored in ADLS Gen2 and only found a way by digging into the source code. I would like to save other people the same trouble by adding this to the docs.

Related Issue(s)

Documentation

I had a hard time figuring out how to connect to a delta table that is stored in ADLS Gen2 and only found a way by digging into the source code. I would like to save other people the same trouble by adding this to the docs.

delta = DeltaTable("adls2://<accountname>/<filesystem>/<path to table>")
dataFrames = delta.to_pyarrow_table().to_pandas()
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Would you like to also add this to the usage docs as well?

It looks like I might have gotten the wrong prefix when I add this 🤦:

https://github.com/delta-io/delta-rs/blame/main/python/docs/source/usage.rst#L59-L60

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use this non-standard uri schema for ADLS?
Should either use abfss if the underlying storage layer "knows" adls or use https with the appropriate blob storage endpoint that underlies every adls account.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, googling a bit, it seems like abfss is somewhat standard: https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction-abfs-uri
Thanks @zeevm

@dgcaron Does that schema work for you?

Copy link
Contributor

@thovoll thovoll Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abfs[s] refers to the Hadoop File System driver (see the docs linked by @wjones127), which is not being used here.

Using https URLs would be inconsistent with the fact that delta-rs accepts URIs with a cloud specific scheme ("s3://" and "gs://") for the Amazon and Google backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use this non-standard uri schema for ADLS? Should either use abfss if the underlying storage layer "knows" adls or use https with the appropriate blob storage endpoint that underlies every adls account.

delta-rs relies on the atomic rename functionality of ADLS Gen2 and cannot work with "plain" Blob Storage.

thovoll
thovoll previously approved these changes Feb 17, 2022
@dgcaron
Copy link
Contributor Author

dgcaron commented Feb 22, 2022

Abfss was what I tried initially, unsuspecting users that will not read the manual will most likely do the same. After failing and not finding the right scheme I went into the code to see what it expects. To save anyone else the search I thought it was a good thing to add it to the docs.

houqp
houqp previously approved these changes Feb 23, 2022
@houqp
Copy link
Member

houqp commented Feb 23, 2022

@dgcaron do you want to include the python usage docs update in this PR as well?

@dgcaron
Copy link
Contributor Author

dgcaron commented Feb 23, 2022 via email

@zeevm
Copy link

zeevm commented Feb 23, 2022

I'd think fixing the docs isn't the issue, the implementation should be fixed to use either abfss if delta-rs knows how to talk to the adls endpoint or https if it talks to the blob storage endpoint, 'adls2' is never the right scheme to use

@thovoll
Copy link
Contributor

thovoll commented Feb 23, 2022

I'd think fixing the docs isn't the issue, the implementation should be fixed to use either abfss if delta-rs knows how to talk to the adls endpoint or https if it talks to the blob storage endpoint, 'adls2' is never the right scheme to use

Hi @zeevm, I addressed all these points above, stated again here with some more elaboration:

  1. abfs[s]:// refers to the Hadoop File System driver, which is not being used by delta-rs. From the Azure docs: "The Hadoop Filesystem driver that is compatible with Azure Data Lake Storage Gen2 is known by its scheme identifier abfs (Azure Blob File System). Consistent with other Hadoop Filesystem drivers, the ABFS driver employs a URI format to address files and directories within a Data Lake Storage Gen2 capable account."

  2. Using https URLs would be inconsistent with the fact that delta-rs accepts URIs with a cloud specific scheme (s3:// and gs://) for the Amazon and Google backends.

Based on those two points, it makes sense to use a cloud specific URI scheme in delta-rs, but not abfs[s]:// - hence: adls2://.

Also: the delta-rs Azure backend relies on the atomic rename functionality of Azure Data Lake Storage (ADLS) Gen2 and doesn't work with Azure Blob Storage.

The adls2:// URI scheme is an invention of delta-rs, but there are other libraries that have done something similar for Azure storage. The root cause of this need is that ADLS Gen2 doesn't define a URI scheme other than in the context of Hadoop. It's worth noting that the S3 Hadoop File System driver doesn't use s3:// but s3a:// - the reasons are complicated but this supports introducing adls2:// to distinguish from abfs[s]://.

@dgcaron
Copy link
Contributor Author

dgcaron commented Mar 1, 2022

did i break the build here?

@thovoll
Copy link
Contributor

thovoll commented Mar 1, 2022

did i break the build here?

Clippy errors can show up because of toolchain updates. Take a look at the error, it has a suggestion to try and fix the clippy error.

My PR #556 has the same clippy error so I will be fixing it there if you want to wait for my merge.

@dgcaron
Copy link
Contributor Author

dgcaron commented Mar 1, 2022

yeah, i'll wait for your merge as it is a linter error for files I didn't touch at all.

@thovoll
Copy link
Contributor

thovoll commented Mar 1, 2022

yeah, i'll wait for your merge as it is a linter error for files I didn't touch at all.

Actually, @houqp already fixed these yesterday. Just fetch upstream on your branch.

@dgcaron
Copy link
Contributor Author

dgcaron commented Mar 7, 2022

do i need to do something additional to get this merged?

@thovoll
Copy link
Contributor

thovoll commented Mar 7, 2022

do i need to do something additional to get this merged?

No, someone with merge privileges has to review, approve, and merge.

@houqp houqp merged commit 6ea0480 into delta-io:main Mar 7, 2022
@houqp
Copy link
Member

houqp commented Mar 7, 2022

Sorry, there were some intermittent CI errors, I was waiting for the rerun of the CI before doing a merge. Thanks again @dgcaron for the fix! And than kyou @thovoll and @wjones127 for the review :)

@thovoll
Copy link
Contributor

thovoll commented Mar 7, 2022

Thanks everyone! @dgcaron @wjones127 @houqp

@dgcaron dgcaron deleted the patch-1 branch March 9, 2022 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants