Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring endpoint for Azure FS #23071

Merged
merged 1 commit into from
Aug 27, 2024
Merged

Allow configuring endpoint for Azure FS #23071

merged 1 commit into from
Aug 27, 2024

Conversation

electrum
Copy link
Member

@electrum electrum commented Aug 19, 2024

Additional context and related issues

Fixes #21307

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Allow configuring endpoint for the native Azure filesystem. ({issue}`23071`)

# Hive
* Allow configuring endpoint for the native Azure filesystem. ({issue}`23071`)

# Hudi
* Allow configuring endpoint for the native Azure filesystem. ({issue}`23071`)

# Iceberg
* Allow configuring endpoint for the native Azure filesystem. ({issue}`23071`)

@cla-bot cla-bot bot added the cla-signed label Aug 19, 2024
@electrum electrum force-pushed the electrum/azurehost branch from 687c30c to c463301 Compare August 19, 2024 20:40
@github-actions github-actions bot added the docs label Aug 19, 2024
Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now.

Copy link
Member

@anusudarsan anusudarsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % one question

Copy link

@ms1111 ms1111 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you think about making the entire endpoint URL configurable, so instead of hardcoding https://<account name> + .dfs. + <endpoint>, a user could configure the endpoint to be "https://<accountname>.dfs.<endpoint>" explicitly?

That would open things up to test with use other implementations like the Azurite emulator in the future, where you're likely to need something like http://<ip or container name or localhost>:some port, with no .dfs.

@electrum electrum force-pushed the electrum/azurehost branch from c463301 to efeb100 Compare August 22, 2024 16:10
@electrum
Copy link
Member Author

@ms1111 That's a good idea, which I considered, but it's complicated due to the ABFS URI syntax containing the storage account. @dain tried using Azurite when he wrote the Azure FS, but ran into various issues and incompatibilities with the real Azure storage, which is why we don't use it for testing.

We can extend the config in the future when there is a real use case or viable alternate implementation. Let us know if you have a specific use case in mind. The goal of the current change is to support all of the Azure clouds while being easy for users to configure.

@electrum electrum requested a review from dain August 22, 2024 16:21
@ms1111
Copy link

ms1111 commented Aug 22, 2024

@electrum Yeah, I was following some of the Azurite issues in Azure/Azurite#553 and I figure if they're solved in the future, then it would require changing the Trino configuration again to allow specifying a local hostname, port, and scheme (http).

I was just thinking that might be a hard sell later, because you'd end up with two different historical ways to override the endpoint - the one being added here, and a new way to allow overriding the other pieces.

Being able to use Azurite is handy for local & CI testing if you have a software project that uses Trino and other components that all need to talk to a data lake. You can write integration tests that hit a few different containers and they all can touch the same tables. (Sometimes you can do this in with file:// URIs instead of emulating abfs:// or s3a://, but not always - there's always some weird limitation, like for example CircleCI's remote docker executor can't do bind mounts, so you can't share files between the primary and secondary containers.)

Because of the hardcoded endpoint URL in Trino, we're trying with S3/MinIO instead, but given we're an Azure shop, it'd be nice to be able to do it all with the Azure filesystem eventually.

@electrum
Copy link
Member Author

@ms1111 We can extend the config in the future to accept a URL. This won't conflict with the existing configuration.

However, I don't think that Azurite integration would be as simple as allowing to configure a URL endpoint, as the way storage accounts are handled when constructing URLs is different. We need to take the storage account from the ABFS location and generate an endpoint URL, which has a different format for real Azure vs Azurite. So if Azurite ever actually supports ADLS, we will likely need to add specific support for it.

@electrum
Copy link
Member Author

We test the S3 file system against MinIO, S3Mock and LocalStack. I agree the local storage tests are very useful and we will certainly support Azurite if they ever implement ADLS support and make it actually compatible with Azure.

@mosabua
Copy link
Member

mosabua commented Aug 26, 2024

Lets ship this @electrum

@mosabua
Copy link
Member

mosabua commented Aug 27, 2024

I am merging after chatting with @anusudarsan

@mosabua mosabua merged commit 475e1d1 into master Aug 27, 2024
69 of 70 checks passed
@mosabua mosabua deleted the electrum/azurehost branch August 27, 2024 16:28
@github-actions github-actions bot added this to the 455 milestone Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Allow Azure govcloud for native Azure file system
5 participants