Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writes to Azure Storage #489

Closed
thovoll opened this issue Nov 5, 2021 · 26 comments · Fixed by #761
Closed

Support writes to Azure Storage #489

thovoll opened this issue Nov 5, 2021 · 26 comments · Fixed by #761
Labels
binding/rust Issues for the Rust crate enhancement New feature or request storage/azure Azure Blog storage related

Comments

@thovoll
Copy link
Contributor

thovoll commented Nov 5, 2021

Description

The delta-rs Azure Storage backend currently doesn't support write operations:

//! The Azure Data Lake Storage Gen2 storage backend. It currently only supports read operations.

As a result, Delta Tables can not be written to Azure Storage.

Several methods need to be implemented:

async fn put_obj(&self, _path: &str, _obj_bytes: &[u8]) -> Result<(), StorageError> {

async fn rename_obj_noreplace(&self, _src: &str, _dst: &str) -> Result<(), StorageError> {

async fn delete_obj(&self, _path: &str) -> Result<(), StorageError> {

The delta-rs Azure Storage backend currently uses the Blob API to access the storage account (ContainerClient), which does not support atomic rename (rename_obj_noreplace). We need to use the ADLS Gen2 API, which does support atomic rename.

Both APIs can be used in parallel with ADLS Gen2 storage accounts, but we may want to simply switch all operations over to the ADLS Gen2 API unless there is an advantage to using the Blob API for some operations.

The other operations required by the delta-rs Azure Storage backend are:

async fn head_obj(&self, path: &str) -> Result<ObjectMeta, StorageError> {

async fn get_obj(&self, path: &str) -> Result<Vec<u8>, StorageError> {

async fn list_objs<'a>(

For atomic rename to work, the actual Azure Storage Account resource being used must have hierarchical namespaces enabled, which is what makes it an "ADLS Gen2" storage account.

Apologies for any remaining confusion, this is a bit tricky to explain.

Also, the Azure Rust SDK doesn't yet support write operations via the ADLS Gen API, but work is underway.

Use Case
Write Delta Tables to Azure Storage

Related Issue(s)
None

@thovoll thovoll added the enhancement New feature or request label Nov 5, 2021
@roeap
Copy link
Collaborator

roeap commented Nov 9, 2021

I am considering looking into this once (if :)) #486 gets merged. MY assumption would be that delete_obj and put_obj could be implemented without the need for hierarchical namespaces enabled. However I am still a bit confused what constitutes gen2... As far as I understand, the *.dfs.core.windows.net endpoint is available on all recent (v2) storage accounts, if HNS is enabled, it becomes ADLS Gen 2, and the file_system APIs / directories become available? If HNS is a requirement, likely #60 should then be addressed as well somewhere.

@thovoll
Copy link
Contributor Author

thovoll commented Nov 9, 2021

Thanks @roeap, that would be awesome. Your assumptions above are all correct. Only HNS supports atomic renames, which will obviate the need for the lock service that we have to use for S3.

@thovoll
Copy link
Contributor Author

thovoll commented Nov 9, 2021

@roeap some things to be aware of:

  • The Azure Rust SDK doesn't have any crate releases so can't be included in a delta-rs crate release yet (hopefully soon)
  • The Azure Rust SDK doesn't have the ADLS Gen2 atomic rename operation yet, but I've been working up to adding it. I just added a bunch of operations and rename_file_if_not_exists is next.

Great to have a collaborator in making delta-rs work with Azure!

@thovoll
Copy link
Contributor Author

thovoll commented Nov 9, 2021

PR for #60 is #492

@roeap
Copy link
Collaborator

roeap commented Nov 9, 2021

@thovoll, thanks for the pointers, and excited to see that the azure-sdk-for-rust repo seems to see some increased activity recently - I'm guessing pushing for a release of the crates... hopefully we can bring the azure support throughout the arrow-rust-data-ecosystem en par with S3 :).

@thovoll
Copy link
Contributor Author

thovoll commented Nov 10, 2021

Keep track of ADLS Gen2 support in the Azure Rust SDK here: Azure/azure-sdk-for-rust#496

@houqp
Copy link
Member

houqp commented Nov 15, 2021

do we know if the atomic rename in gen2 api is a cheap pointer change or will actually require data copy behind the scene? If it's the latter, we might be better off using simple put if absent semantic instead. @blogle started a good discussion around this redesign in this slack thread for the GCS backend.

@thovoll
Copy link
Contributor Author

thovoll commented Nov 15, 2021

Yes, file rename is a cheap and atomic operation in ADLS Gen2.

@thovoll
Copy link
Contributor Author

thovoll commented Nov 15, 2021

We now have the required operation in the Azure Rust SDK that enables us to implement rename_obj_noreplace in the delta-rs Azure Storage backend: Azure/azure-sdk-for-rust#506

We should be able to implement put_obj and delete_obj using the Blob API, just like head_obj, get_obj, and list_objs are implemented. Initially, only rename_obj_noreplace will need to use the ADLS Gen2 API. To reiterate, we should be able to use both Blob API and ADLS Gen2 API in parallel. We can decide later if we want to exclusively use the ADLS Gen2 API (as of now, not all required operations are implemented there).

I also know #486 is ready to merge now, so we will be able to start working on supporting writes to Azure Storage soon.

@andrei-ionescu
Copy link
Contributor

andrei-ionescu commented Nov 30, 2021

Is this 3rd party Azure SDK implementation found here - https://crates.io/crates/azure_sdk_for_rust - the same one? It has crates already released.

@roeap
Copy link
Collaborator

roeap commented Dec 1, 2021

I think these crates are actually the foundation for the development going on under the Azure github org right now. If you go to the repo, you'll see a notice that they migrated. Looking at the efforts in the new repo, they are working hard on stabilizing APIs and hopefully soon cutting a 0.1 release.

@thovoll
Copy link
Contributor Author

thovoll commented Dec 8, 2021

@roeap @houqp what do you think about the auth methods we should support?

After #486 we have:

  • Static auth: shared key, shared access signature (SAS), and connection string (shared key)
  • Auto-refreshing auth based on DefaultAzureCredential

None of these should be used in production applications in my opinion, for various reasons. The static methods present key rotation and authz challenges. DefaultAzureCredential is just a convenience abstraction with non-deterministic behavior.

I propose we use managed identity credentials (AMI). The downside is that they are hard to use during development, but we can support az cli credentials for this and activate that path using a suitable switch in delta-rs, maybe an environment variable.

Making these work at runtime requires an identity in Azure (which the key based methods don't), but this is the right direction since it better supports key rotation and authz.

I realize we should probably align this with how we do things for AWS and GCP, but I wanted to present this unadulterated argument first.

Any thoughts?

@houqp
Copy link
Member

houqp commented Dec 9, 2021

Thanks for starting this discussion @thovoll . I think we should definitely have first class support for AMI due to production requirements. @thovoll does Azure sdk support the concept of credential chain? Basically the way it works for AWS is the client will automatically identify the best auth method based on provided environment variable and credential config file across different languages and platforms: see https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/credentials.html and https://github.com/rusoto/rusoto/blob/master/AWS-CREDENTIALS.md. If we have something similar provided by the azure sdk out of the box, that would be ideal. If not, we can implement a temp workaround within delta-rs to simulate something similar. For example, go for AMI if it's available, if not try other routes that are easier to prepare for development environments.

@roeap
Copy link
Collaborator

roeap commented Dec 9, 2021

The DefaultCredential we use is an credential chain. Only authorization via KEY or or SAS is not part of this chain. I am not sure about priorities in the chained credential, but assume this can be either configured or achieved with a simple custom credential.

@thovoll
Copy link
Contributor Author

thovoll commented Dec 9, 2021

Correct, default credential chain from the Azure Rust SDK can be configured but it's still rudimentary, also in terms of the methods it supports. There's no custom chain in the Azure Rust SDK yet but the Azure .Net SDK for example has more complete default and custom credential chains which will be implemented in the Azure Rust SDK at some point.

Regardless, my argument is that credential chains are convenient but shouldn't be used for production code. I'd rather explicitly state what credentials I want to use than falling through a hidden switch statement based on environmental conditions. The potential 'decoy' log entries alone are distracting.

Granted, if we want to support all the flavors it would be a little tedious to implement a deterministic approach - but I'd probably still prefer it. Since delta-rs is a library we should probably support all the credential types (but without using a chain).

The AWS docs that @houqp linked recommend using the default chain but also mention the other approaches:

  • Use the default credential provider chain (recommended).
  • Use a specific credential provider or provider chain (or create your own).
  • Supply the credentials yourself. These can be root account credentials, IAM credentials, or temporary credentials retrieved from AWS STS.

The Azure docs at least hint at the tradeoff:

Note: DefaultAzureCredential is intended to simplify getting started with the SDK by handling common scenarios with reasonable default behaviors. Developers who want more control or whose scenario isn't served by the default settings should use other credential types.

What I'm arguing here is that we should want more control. An open question is "where is the right place?" Maybe delta-rs, maybe leave it up to consumers like kafka-delta-ingest.

@thovoll
Copy link
Contributor Author

thovoll commented Dec 9, 2021

As for a concrete long-term proposal (long-term because the Azure Rust SDK doesn't fully support this yet):

  • Take TokenCredential as a parameter in delta-rs and pass it to the DataLakeClient (not possible yet but here is what that looks like in the Azure .Net SDK).
  • This still allows delta-rs users to specify a chain since DefaultAzureCredential is a TokenCredential. But it doesn't allow delta-rs users to use other methods like shared keys or connection strings (which the Azure SDK supports but we would be hiding from delta-rs users, indicating our industrial strength intent).
  • Maybe we should support SAS but this method is intended for "limited delegated access" (potentially a scenario that delta-rs would be used in).
  • Also need to revisit how we can auto-refresh the credential.
  • In kafka-delta-ingest, don't pass DefaultAzureCredential to delta-rs but implement a more opinionated and deterministic approach. Only support AMI (for prod) and CLI (for dev), possibly controlled by a command line argument or environment variable.

Again, this is a long-term proposal but we can still discuss this in principle I think. If we agree, we can plot a path towards this future across all 3 projects (kafka-delta-ingest, delta-rs, and azure-sdk-for-rust).

@thovoll
Copy link
Contributor Author

thovoll commented Dec 9, 2021

As for a concrete short-term proposal:

  • Obtain a token (via DefaultAzureCredential) in delta-rs
  • Pass the token to DataLakeClient (this is the only method supported by the Azure Rust SDK at the moment)
  • This is the quickest way to enable writes to Azure Storage.
  • Improve from there, i.e. re-implement auto-refreshing credentials and everything else mentioned above over time.

I'll raise a draft PR for easier discussion.

@houqp
Copy link
Member

houqp commented Dec 9, 2021

Ha, I see where I am missing now. In AWS, it's common to use default credential chain in production, but looks like it's not the case in Azure. That said, for s3, we do have an option to let users explicitly provide a preconfigured client when creating the storage backend instance:

pub fn new_from_options(options: S3StorageOptions) -> Result<Self, StorageError> {
.

@thovoll
Copy link
Contributor Author

thovoll commented Dec 9, 2021

It's definitely doable and I'm sure many people use credential chains in Azure as well. Irrespective of how common it is, what do you think about the argument otherwise?

For S3, I think you meant this one?

pub fn new_with(
client: rusoto_s3::S3Client,
lock_client: Option<Box<dyn LockClient>>,
options: S3StorageOptions,
) -> Self {

I'll take a closer look and try to keep it close.

Do we support auto-refreshing credentials in S3 and GCP?

@houqp
Copy link
Member

houqp commented Dec 10, 2021

Yeah, sorry @thovoll , new_with is the method that I was trying to use as an example. Auto refresh of credential should be supported by cloud SDK if possible.

@roeap
Copy link
Collaborator

roeap commented Dec 10, 2021

The way we implemented for the Azure credentials is actually heavily inspired by the internals of the rusto s3 crate 🙂.

@thovoll
Copy link
Contributor Author

thovoll commented Dec 10, 2021

Does the AWS and GCP SDK support auto-refresh?

@houqp
Copy link
Member

houqp commented Dec 11, 2021

Yes, I remember the AWS SDK has an auto refreshing credential provider builtin.

@thovoll
Copy link
Contributor Author

thovoll commented Dec 13, 2021

I'm looking into how to best implement auto-refresh.

@thovoll
Copy link
Contributor Author

thovoll commented Dec 16, 2021

Ok, so auto-refresh in the Azure Rust SDK will take some time, filed an issue here: Azure/azure-sdk-for-rust#543

In the meantime, we'll have to implement auto-refresh in delta-rs.

@rtyler rtyler added binding/rust Issues for the Rust crate storage/azure Azure Blog storage related labels Dec 16, 2021
@roeap
Copy link
Collaborator

roeap commented May 15, 2022

@thovoll - this is done by now, right?

@roeap roeap mentioned this issue Aug 22, 2022
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate enhancement New feature or request storage/azure Azure Blog storage related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants