-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40028: [C++][FS][Azure] Add AzureFileSystem support to FileSystemFromUri() #40325
Conversation
|
@Tom-Newton Are |
I would say yes. I think originally they came from Hadoop with the extra "s" indicating secure but other filesystem implementations seem to have adopted it and they seem to be used mostly interchangeably. |
From the documentation if it helps
|
Thanks for the note. I've added the documentation URL as a comment. |
cpp/src/arrow/filesystem/azurefs.cc
Outdated
Result<AzureOptions> AzureOptions::FromUri(const arrow::internal::Uri& uri, | ||
std::string* out_path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A PR by @bkietz is moving Uri
out of internal
so we should be careful with the merges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the info!
#39067
cpp/src/arrow/filesystem/azurefs.cc
Outdated
if (container.empty()) { | ||
return Status::Invalid("Missing container name in Azure Blob File System URI"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you need a container name if the filesystem wraps the entire storage account?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we don't need this check. I'll remove this.
(I used GcsOptions::FromUri()
as a base implementation and forgot to remove this check.)
cpp/src/arrow/filesystem/azurefs.cc
Outdated
std::unordered_map<std::string, std::string> options_map; | ||
ARROW_ASSIGN_OR_RAISE(const auto options_items, uri.query_items()); | ||
for (const auto& kv : options_items) { | ||
options_map.emplace(kv.first, kv.second); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to build the map if you're going to iterate over the kv pairs and switch. This is just randomizing the iteration order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I borrowed this implementation from GcsOptions::FromUri()
but I should have noticed this.
(We should remove this conversion in GcsOptions::FromUri()
too later.)
options.blob_storage_scheme = kv.second; | ||
} else if (kv.first == "dfs_storage_scheme") { | ||
options.dfs_storage_scheme = kv.second; | ||
} else if (kv.first == "credential_kind") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
credential_kind_
should be inferred from what you find on the URI without the user having to set both the credential kind and the credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that we should use ConfigureClientSecretCredential()
if tenant_id
, client_id
and client_secret
are specified but credential_kind=client_secret
isn't specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
credential_kind
should never be specified and we should validate the URI to keep the invariant that it doesn't configure two different auth methods. And when nothing is provided, we use the default auth chain provided by the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, how can we distinguish ConfigureAnonymousCredential()
, ConfigureWorkloadIdentityCredential()
and ConfigureDefaultCredential()
? All of them don't require additional information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter-less auth methods can have dedicated query params for each. These being the valid configurations regarding auth:
- nothing (use default auth chain)
?anonymous
?use_workload_identity
?account_key=<ACCOUNT_KEY>
?tenant_id=<TENANT_ID>&client_id=<CLIENT_ID>&client_secret=<CLIENT_SECRET>
?client_id=<CLIENT_ID>
(client_id alone means managed identity credential)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?anonymous
and ?use_workaround_identity
are conflicted parameters. (We can't specify both of them at once.) I think that it's better that we use the same parameter name for the type (XXX={anonymous,workload_identity}
). If we use it, users can't specify both of them at once. (I know that URI spec accepts XXX=anonymous&XXX=workload_identity
.)
How about accepting only (default
, ) anonymous
and use_workload_identity
as valid credential_kind
parameter?
- nothing (use default auth chain) -> nothing or
?credential_kind=default
?anonymous
->?credential_kind=anonymous
?use_workload_identity
->?credential_kind=workload_identity
?account_key=<ACCOUNT_KEY>
-> not changed (?credential_kind=storage_shared_key
is invalid)?tenant_id=<TENANT_ID>&client_id=<CLIENT_ID>&client_secret=<CLIENT_SECRET>
-> not changed (?credential_kind=client_secret
is invalid)?client_id=<CLIENT_ID>
(client_id alone means managed identity credential) -> not changed (?credential_kind=managed_identity
is invalid)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we don't need ?account_key=<ACCOUNT_KEY>
because we can get it from the URI's password part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about accepting only (default, ) anonymous and use_workload_identity as valid credential_kind parameter?
Sure. That looks good.
@kou @Tom-Newton can we use |
I have quite a strong opinion that we should support the |
Fair enough, then what are the semantics of each URI format and how they map to this implementation? I will start with one rule: both |
That is fine with me.
For Hadoop format URIs I would probably just extract the storage account name. That means there is a lot of redundant information in the URI but I don't think that is really a problem and it gives us compatibility which I think is important. If we want to support other URIs formats too that would be useful to some people. Some examples I've seen on other filesystems: https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.with_url https://github.com/fsspec/adlfs |
Thank you @Tom-Newton! This list from the Rust crate defines formats that allow us to express URIs that refer to the entire storage account and not just a specific filesystem. Plus, simple and short URIs that allow us to simply use the default endpoints. cc @kou |
URI list from https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.with_url :
We can use |
@kou let's implement only
Supporting |
Can we also support one more format for Azurite?
If |
Sure. I think that can work well. |
OK. I'll implement the discussed spec. |
Supported formats: 1. abfs[s]://[:<password>@]<account>.blob.core.windows.net[/<container>[/<path>]] 2. abfs[s]://<container>[:<password>]@<account>.dfs.core.windows.net[/path] 3. abfs[s]://[<account[:<password>]@]<host[.domain]>[<:port>][/<container>[/path]] 4. abfs[s]://[<account[:<password>]@]<container>[/path] Added query parameters: * enable_tls: It replaces blob_storage_scheme and dfs_storage_scheme parameters. Removed query parameters: * blob_storage_scheme: Replaced with enable_tls. * dfs_storage_scheme: Replaced with enable_tls. Changed query parameters: * credential_kind: Accepts only "default", "anonymous" and "workload_identity".
Implemented: Supported formats:
Added query parameters:
Removed query parameters:
Changed query parameters:
|
I'll merge this in the next week if nobody objects it. |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 605f8a7. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them. |
### Rationale for this change Failure to rebase and build when merging #39067 (which renamed `internal::Uri` -> `util::Uri`) led to a merge conflict since #40325 added more usages of `internal::Uri` ### What changes are included in this PR? Rename internal::Uri -> util::Uri ### Are these changes tested? Yes ### Are there any user-facing changes? No * GitHub Issue: #40562 Authored-by: Benjamin Kietzman <[email protected]> Signed-off-by: Benjamin Kietzman <[email protected]>
Rationale for this change
FileSystemFromUri()
is a common API to create a file system object.FileSystemFromUri()
should be able to create anAzureFileSystem
object.What changes are included in this PR?
Add
AzureOptions::FromUri()
and use it fromFileSystemFromUri()
.See the
AzureOptions::FromUri()
's docstring about the supported formats.Are these changes tested?
Yes.
Are there any user-facing changes?
Yes.