Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support WASB scheme in ADLSFileIO #11504

Merged
merged 3 commits into from
Nov 14, 2024

Conversation

mrcnc
Copy link
Contributor

@mrcnc mrcnc commented Nov 9, 2024

This is a second attempt to resolve #10127 but avoids using java.net.URI for parsing, which was found to be problematic after the first attempt was merged.

Additionally I've refactored to minimize the number of lines changed and clarify the usage of the storageAccount variable, which was previously storing the endpoint host and now will only store the storage account name as the subdomain.

Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just wonder about the suffix appended by default to account.

@@ -93,7 +93,7 @@ public void applyClientConfiguration(String account, DataLakeFileSystemClientBui
if (connectionString != null && !connectionString.isEmpty()) {
builder.endpoint(connectionString);
} else {
builder.endpoint("https://" + account);
builder.endpoint("https://" + account + ".dfs.core.windows.net");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really change add .def.core.windows.net by default ?
I mean that use currently include the suffix in account so it should be clearly state that we append the suffix by default now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the storageAccount of ADLSLocation should be used for storing the storage account name only. And since the ADLS APIs are accessed via dfs.core.windows.net by default, I think it's appropriate to append it to the storage account name here as the default. If users want to specify a different hostname, they can use the adls.connection-string property.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right thing to do, I would just throw a javadoc on this method though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deviates from the abfs[s] scheme defined by Hadoop, where the domain is specified in the URI (docs). I feel we should check if the account already ends with the domain and only append then to support that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are stripping off the domain below, so this isn't an issue. You can ignore the above.

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryanck + @danielcweeks - Please check, I was thinking this could be another candidate to put in 1.7.1 as well

@Fokko Fokko added this to the Iceberg 1.7.1 milestone Nov 13, 2024
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mrcnc , yeah I think supporting WASB despite it being deprecated by Azure is worth it and I think the way this PR handles it is correct.

@bryanck
Copy link
Contributor

bryanck commented Nov 14, 2024

Thanks for the PR @mrcnc ! And for the reviews @RussellSpitzer @amogh-jahagirdar @jbonofre !

@bryanck bryanck merged commit 0963485 into apache:main Nov 14, 2024
49 checks passed
@mrcnc mrcnc deleted the support-wasb-scheme-for-adlsfileio branch November 15, 2024 17:06
bryanck pushed a commit that referenced this pull request Nov 19, 2024
@bryanck bryanck mentioned this pull request Nov 19, 2024
bryanck pushed a commit that referenced this pull request Nov 19, 2024
bryanck pushed a commit that referenced this pull request Nov 20, 2024
bryanck pushed a commit that referenced this pull request Nov 20, 2024
@ajreid21
Copy link
Contributor

ajreid21 commented Dec 16, 2024

I think this unfortunately is/was a potentially breaking change 😞 - related to the concern @jbonofre brought up and anyone supporting ADLSFileIO creds through sasToken will break upon 1.7.1 upgrade.

@ajreid21
Copy link
Contributor

AzureProperties builds a map of account -> sasToken here when you create ADLSFileIO using adls.sas-token. as the credential mechanism.

Prior to this change, the account passed in here (which is used to lookup the sasToken from the map mentioned above) was parsed to include .dfs.core.windows.net so when generating the adls.sas-token. property to pass into ADLSFileIO, you needed to include .dfs.core.windows.net as part of the adls.sas-token. property name. After this change, we are parsing the ADLSLocation account to NOT include adls.sas-token., so now the sasToken lookup from the map doesn't find the sasToken (unless the user updates their code to not include it in the property name anymore)...

@bryanck
Copy link
Contributor

bryanck commented Dec 17, 2024

In retrospect this is probably a feature change we shouldn't have added in a patch release. We should probably revert this and have another patch release.

mrcnc added a commit to mrcnc/iceberg that referenced this pull request Dec 18, 2024
bryanck pushed a commit that referenced this pull request Dec 18, 2024
mrcnc added a commit to mrcnc/iceberg that referenced this pull request Dec 18, 2024
amogh-jahagirdar pushed a commit that referenced this pull request Dec 19, 2024
mrcnc added a commit to mrcnc/iceberg that referenced this pull request Dec 19, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable reading WASB and WASBS file paths with ABFS and ABFSS
7 participants