-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support writes to Azure Storage #489
Comments
I am considering looking into this once (if :)) #486 gets merged. MY assumption would be that |
Thanks @roeap, that would be awesome. Your assumptions above are all correct. Only HNS supports atomic renames, which will obviate the need for the lock service that we have to use for S3. |
@roeap some things to be aware of:
Great to have a collaborator in making delta-rs work with Azure! |
@thovoll, thanks for the pointers, and excited to see that the azure-sdk-for-rust repo seems to see some increased activity recently - I'm guessing pushing for a release of the crates... hopefully we can bring the azure support throughout the arrow-rust-data-ecosystem en par with S3 :). |
Keep track of ADLS Gen2 support in the Azure Rust SDK here: Azure/azure-sdk-for-rust#496 |
do we know if the atomic rename in gen2 api is a cheap pointer change or will actually require data copy behind the scene? If it's the latter, we might be better off using simple put if absent semantic instead. @blogle started a good discussion around this redesign in this slack thread for the GCS backend. |
Yes, file rename is a cheap and atomic operation in ADLS Gen2. |
We now have the required operation in the Azure Rust SDK that enables us to implement We should be able to implement I also know #486 is ready to merge now, so we will be able to start working on supporting writes to Azure Storage soon. |
Is this 3rd party Azure SDK implementation found here - https://crates.io/crates/azure_sdk_for_rust - the same one? It has crates already released. |
I think these crates are actually the foundation for the development going on under the Azure github org right now. If you go to the repo, you'll see a notice that they migrated. Looking at the efforts in the new repo, they are working hard on stabilizing APIs and hopefully soon cutting a 0.1 release. |
@roeap @houqp what do you think about the auth methods we should support? After #486 we have:
None of these should be used in production applications in my opinion, for various reasons. The static methods present key rotation and authz challenges. DefaultAzureCredential is just a convenience abstraction with non-deterministic behavior. I propose we use managed identity credentials (AMI). The downside is that they are hard to use during development, but we can support Making these work at runtime requires an identity in Azure (which the key based methods don't), but this is the right direction since it better supports key rotation and authz. I realize we should probably align this with how we do things for AWS and GCP, but I wanted to present this unadulterated argument first. Any thoughts? |
Thanks for starting this discussion @thovoll . I think we should definitely have first class support for AMI due to production requirements. @thovoll does Azure sdk support the concept of credential chain? Basically the way it works for AWS is the client will automatically identify the best auth method based on provided environment variable and credential config file across different languages and platforms: see https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html and https://github.com/rusoto/rusoto/blob/master/AWS-CREDENTIALS.md. If we have something similar provided by the azure sdk out of the box, that would be ideal. If not, we can implement a temp workaround within delta-rs to simulate something similar. For example, go for AMI if it's available, if not try other routes that are easier to prepare for development environments. |
The DefaultCredential we use is an credential chain. Only authorization via KEY or or SAS is not part of this chain. I am not sure about priorities in the chained credential, but assume this can be either configured or achieved with a simple custom credential. |
Correct, default credential chain from the Azure Rust SDK can be configured but it's still rudimentary, also in terms of the methods it supports. There's no custom chain in the Azure Rust SDK yet but the Azure .Net SDK for example has more complete default and custom credential chains which will be implemented in the Azure Rust SDK at some point. Regardless, my argument is that credential chains are convenient but shouldn't be used for production code. I'd rather explicitly state what credentials I want to use than falling through a hidden switch statement based on environmental conditions. The potential 'decoy' log entries alone are distracting. Granted, if we want to support all the flavors it would be a little tedious to implement a deterministic approach - but I'd probably still prefer it. Since delta-rs is a library we should probably support all the credential types (but without using a chain). The AWS docs that @houqp linked recommend using the default chain but also mention the other approaches:
The Azure docs at least hint at the tradeoff:
What I'm arguing here is that we should want more control. An open question is "where is the right place?" Maybe delta-rs, maybe leave it up to consumers like kafka-delta-ingest. |
As for a concrete long-term proposal (long-term because the Azure Rust SDK doesn't fully support this yet):
Again, this is a long-term proposal but we can still discuss this in principle I think. If we agree, we can plot a path towards this future across all 3 projects ( |
As for a concrete short-term proposal:
I'll raise a draft PR for easier discussion. |
Ha, I see where I am missing now. In AWS, it's common to use default credential chain in production, but looks like it's not the case in Azure. That said, for s3, we do have an option to let users explicitly provide a preconfigured client when creating the storage backend instance: delta-rs/rust/src/storage/s3/mod.rs Line 413 in cba4e3d
|
It's definitely doable and I'm sure many people use credential chains in Azure as well. Irrespective of how common it is, what do you think about the argument otherwise? For S3, I think you meant this one? delta-rs/rust/src/storage/s3/mod.rs Lines 425 to 429 in cba4e3d
I'll take a closer look and try to keep it close. Do we support auto-refreshing credentials in S3 and GCP? |
Yeah, sorry @thovoll , |
The way we implemented for the Azure credentials is actually heavily inspired by the internals of the rusto s3 crate 🙂. |
Does the AWS and GCP SDK support auto-refresh? |
Yes, I remember the AWS SDK has an auto refreshing credential provider builtin. |
I'm looking into how to best implement auto-refresh. |
Ok, so auto-refresh in the Azure Rust SDK will take some time, filed an issue here: Azure/azure-sdk-for-rust#543 In the meantime, we'll have to implement auto-refresh in |
@thovoll - this is done by now, right? |
Description
The delta-rs Azure Storage backend currently doesn't support write operations:
delta-rs/rust/src/storage/azure.rs
Line 1 in 50a53ca
As a result, Delta Tables can not be written to Azure Storage.
Several methods need to be implemented:
delta-rs/rust/src/storage/azure.rs
Line 192 in 50a53ca
delta-rs/rust/src/storage/azure.rs
Line 196 in 50a53ca
delta-rs/rust/src/storage/azure.rs
Line 200 in 50a53ca
The delta-rs Azure Storage backend currently uses the Blob API to access the storage account (
ContainerClient
), which does not support atomic rename (rename_obj_noreplace). We need to use the ADLS Gen2 API, which does support atomic rename.Both APIs can be used in parallel with ADLS Gen2 storage accounts, but we may want to simply switch all operations over to the ADLS Gen2 API unless there is an advantage to using the Blob API for some operations.
The other operations required by the delta-rs Azure Storage backend are:
delta-rs/rust/src/storage/azure.rs
Line 122 in 50a53ca
delta-rs/rust/src/storage/azure.rs
Line 141 in 50a53ca
delta-rs/rust/src/storage/azure.rs
Line 157 in 50a53ca
For atomic rename to work, the actual Azure Storage Account resource being used must have hierarchical namespaces enabled, which is what makes it an "ADLS Gen2" storage account.
Apologies for any remaining confusion, this is a bit tricky to explain.
Also, the Azure Rust SDK doesn't yet support write operations via the ADLS Gen API, but work is underway.
Use Case
Write Delta Tables to Azure Storage
Related Issue(s)
None
The text was updated successfully, but these errors were encountered: