Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40028: [C++][FS][Azure] Add AzureFileSystem support to FileSystemFromUri() #40325

Merged
merged 9 commits into from
Mar 11, 2024

Conversation

kou
Copy link
Member

@kou kou commented Mar 3, 2024

Rationale for this change

FileSystemFromUri() is a common API to create a file system object. FileSystemFromUri() should be able to create an AzureFileSystem object.

What changes are included in this PR?

Add AzureOptions::FromUri() and use it from FileSystemFromUri().

See the AzureOptions::FromUri()'s docstring about the supported formats.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@kou kou requested a review from felipecrv March 3, 2024 09:34
Copy link

github-actions bot commented Mar 3, 2024

⚠️ GitHub issue #40028 has been automatically assigned in GitHub to PR creator.

@kou
Copy link
Member Author

kou commented Mar 3, 2024

@Tom-Newton Are abfs:// and abfss:// natural for AzureFileSystem?

@Tom-Newton
Copy link
Contributor

@Tom-Newton Are abfs:// and abfss:// natural for AzureFileSystem?

I would say yes. I think originally they came from Hadoop with the extra "s" indicating secure but other filesystem implementations seem to have adopted it and they seem to be used mostly interchangeably.

@nosterlu
Copy link

nosterlu commented Mar 3, 2024

@Tom-Newton Are abfs:// and abfss:// natural for AzureFileSystem?

I would say yes. I think originally they came from Hadoop with the extra "s" indicating secure but other filesystem implementations seem to have adopted it and they seem to be used mostly interchangeably.

From the documentation if it helps

The abfs protocol is used as the scheme identifier. If you add an s at the end (abfss) then the ABFS Hadoop client driver will always use Transport Layer Security (TLS) irrespective of the authentication method chosen. If you choose OAuth as your authentication, then the client driver will always use TLS even if you specify abfs instead of abfss because OAuth solely relies on the TLS layer. Finally, if you choose to use the older method of storage account key, then the client driver interprets abfs to mean that you don't want to use TLS.

@kou
Copy link
Member Author

kou commented Mar 3, 2024

Thanks for the note. I've added the documentation URL as a comment.

Comment on lines 68 to 69
Result<AzureOptions> AzureOptions::FromUri(const arrow::internal::Uri& uri,
std::string* out_path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A PR by @bkietz is moving Uri out of internal so we should be careful with the merges.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the info!
#39067

Comment on lines 108 to 110
if (container.empty()) {
return Status::Invalid("Missing container name in Azure Blob File System URI");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you need a container name if the filesystem wraps the entire storage account?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we don't need this check. I'll remove this.
(I used GcsOptions::FromUri() as a base implementation and forgot to remove this check.)

Comment on lines 115 to 119
std::unordered_map<std::string, std::string> options_map;
ARROW_ASSIGN_OR_RAISE(const auto options_items, uri.query_items());
for (const auto& kv : options_items) {
options_map.emplace(kv.first, kv.second);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to build the map if you're going to iterate over the kv pairs and switch. This is just randomizing the iteration order.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I borrowed this implementation from GcsOptions::FromUri() but I should have noticed this.
(We should remove this conversion in GcsOptions::FromUri() too later.)

options.blob_storage_scheme = kv.second;
} else if (kv.first == "dfs_storage_scheme") {
options.dfs_storage_scheme = kv.second;
} else if (kv.first == "credential_kind") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

credential_kind_ should be inferred from what you find on the URI without the user having to set both the credential kind and the credentials.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that we should use ConfigureClientSecretCredential() if tenant_id, client_id and client_secret are specified but credential_kind=client_secret isn't specified?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

credential_kind should never be specified and we should validate the URI to keep the invariant that it doesn't configure two different auth methods. And when nothing is provided, we use the default auth chain provided by the SDK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how can we distinguish ConfigureAnonymousCredential(), ConfigureWorkloadIdentityCredential() and ConfigureDefaultCredential()? All of them don't require additional information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter-less auth methods can have dedicated query params for each. These being the valid configurations regarding auth:

  • nothing (use default auth chain)
  • ?anonymous
  • ?use_workload_identity
  • ?account_key=<ACCOUNT_KEY>
  • ?tenant_id=<TENANT_ID>&client_id=<CLIENT_ID>&client_secret=<CLIENT_SECRET>
  • ?client_id=<CLIENT_ID> (client_id alone means managed identity credential)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?anonymous and ?use_workaround_identity are conflicted parameters. (We can't specify both of them at once.) I think that it's better that we use the same parameter name for the type (XXX={anonymous,workload_identity}). If we use it, users can't specify both of them at once. (I know that URI spec accepts XXX=anonymous&XXX=workload_identity.)

How about accepting only (default, ) anonymous and use_workload_identity as valid credential_kind parameter?

  • nothing (use default auth chain) -> nothing or ?credential_kind=default
  • ?anonymous -> ?credential_kind=anonymous
  • ?use_workload_identity -> ?credential_kind=workload_identity
  • ?account_key=<ACCOUNT_KEY> -> not changed (?credential_kind=storage_shared_key is invalid)
  • ?tenant_id=<TENANT_ID>&client_id=<CLIENT_ID>&client_secret=<CLIENT_SECRET> -> not changed (?credential_kind=client_secret is invalid)
  • ?client_id=<CLIENT_ID> (client_id alone means managed identity credential) -> not changed (?credential_kind=managed_identity is invalid)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we don't need ?account_key=<ACCOUNT_KEY> because we can get it from the URI's password part.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about accepting only (default, ) anonymous and use_workload_identity as valid credential_kind parameter?

Sure. That looks good.

cpp/src/arrow/filesystem/azurefs.h Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/azurefs.cc Outdated Show resolved Hide resolved
cpp/src/arrow/filesystem/filesystem.cc Outdated Show resolved Hide resolved
@felipecrv
Copy link
Contributor

@kou @Tom-Newton can we use azfs (only https and by default) because this implementation uses both Blobs and File System APIs?

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Mar 4, 2024

@kou @Tom-Newton can we use azfs (only https and by default) because this implementation uses both Blobs and File System APIs?

I have quite a strong opinion that we should support the abfss:// and abfs:// Hadoop path formats. As a user I don't want to use different path formats when I use different filesystem implementations.

@felipecrv
Copy link
Contributor

felipecrv commented Mar 4, 2024

@kou @Tom-Newton can we use azfs (only https and by default) because this implementation uses both Blobs and File System APIs?

I have quite a strong opinion that we should support the abfss:// and abfs:// Hadoop path formats. As a user I don't want to use different path formats when I use different filesystem implementations.

Fair enough, then what are the semantics of each URI format and how they map to this implementation?

I will start with one rule: both abfss and abfs map to scheme=https in AzureOptions cause people running pyarrow on insecure Wi-Fi networks shouldn't have their credentials leaked.

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Mar 4, 2024

I will start with one rule: both abfss and abfs map to scheme=https in AzureOptions

That is fine with me.

what are the semantics of each URI format and how they map to this implementation?

For Hadoop format URIs I would probably just extract the storage account name. That means there is a lot of redundant information in the URI but I don't think that is really a problem and it gives us compatibility which I think is important.

If we want to support other URIs formats too that would be useful to some people. Some examples I've seen on other filesystems: https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.with_url https://github.com/fsspec/adlfs

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 5, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 5, 2024
@felipecrv
Copy link
Contributor

If we want to support other URIs formats too that would be useful to some people. Some examples I've seen on other filesystems: https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.with_url https://github.com/fsspec/adlfs

Thank you @Tom-Newton! This list from the Rust crate defines formats that allow us to express URIs that refer to the entire storage account and not just a specific filesystem. Plus, simple and short URIs that allow us to simply use the default endpoints. cc @kou

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 6, 2024
@kou
Copy link
Member Author

kou commented Mar 6, 2024

URI list from https://docs.rs/object_store/latest/object_store/azure/struct.MicrosoftAzureBuilder.html#method.with_url :

  • abfs[s]://<container>/<path> (according to fsspec)
  • abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>
  • abfs[s]://<file_system>@<account_name>.dfs.fabric.microsoft.com/<path>
  • az://<container>/<path>
  • adl://<container>/<path>
  • azure://<container>/<path>
  • https://<account>.dfs.core.windows.net
  • https://<account>.blob.core.windows.net
  • https://<account>.blob.core.windows.net/<container>
  • https://<account>.dfs.fabric.microsoft.com
  • https://<account>.dfs.fabric.microsoft.com/<container>
  • https://<account>.blob.fabric.microsoft.com
  • https://<account>.blob.fabric.microsoft.com/<container>

We can use abfs/abfss/az/adl/azure schemes with the current FileSystemFromUri() mechanism. But https is a bit tricky. We need to check not only the scheme part but also the host part. This means that we need to put Azure related URI parsing code to filesystem.cc and azurefs.cc. And this isn't suitable for #39067 . #39067 only uses the scheme part to dispatch filesystem implementation.

@felipecrv
Copy link
Contributor

felipecrv commented Mar 6, 2024

@kou let's implement only abfs[s] for now. Not even use the az, adl, azure schemes.

abfs[s]://<container>/<path> (according to [fsspec](https://github.com/fsspec/adlfs))
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>
abfs[s]://<file_system>@<account_name>.dfs.fabric.microsoft.com/<path>

Supporting https:// would onlye make sense if we were creating an Object Store API, but we are implementing filesystem abstractions on top of object stores accessible via HTTP, so it makes more sense to have them named something that is not http[s].

@kou
Copy link
Member Author

kou commented Mar 6, 2024

Can we also support one more format for Azurite?

abfs[s]://<host>:<port>/<container>/<path>

If <host> doesn't have . and the <port> part doesn't exist, we will interpret the given URI as abfs[s]://<container>/<path>.

@felipecrv
Copy link
Contributor

Can we also support one more format for Azurite?

abfs[s]://<host>:<port>/<container>/<path>

If <host> doesn't have . and the <port> part doesn't exist, we will interpret the given URI as abfs[s]://<container>/<path>.

Sure. I think that can work well.

@kou
Copy link
Member Author

kou commented Mar 7, 2024

OK. I'll implement the discussed spec.

Supported formats:

1. abfs[s]://[:<password>@]<account>.blob.core.windows.net[/<container>[/<path>]]
2. abfs[s]://<container>[:<password>]@<account>.dfs.core.windows.net[/path]
3. abfs[s]://[<account[:<password>]@]<host[.domain]>[<:port>][/<container>[/path]]
4. abfs[s]://[<account[:<password>]@]<container>[/path]

Added query parameters:

* enable_tls: It replaces blob_storage_scheme and dfs_storage_scheme
  parameters.

Removed query parameters:

* blob_storage_scheme: Replaced with enable_tls.
* dfs_storage_scheme: Replaced with enable_tls.

Changed query parameters:

* credential_kind: Accepts only "default", "anonymous" and
  "workload_identity".
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 7, 2024
@kou
Copy link
Member Author

kou commented Mar 7, 2024

Implemented:

Supported formats:

  1. abfs[s]://[:<password>@]<account>.blob.core.windows.net[/<container>[/<path>]]
  2. abfs[s]://<container>[:<password>]@<account>.dfs.core.windows.net[/path]
  3. abfs[s]://[<account[:<password>]@]<host[.domain]>[<:port>][/<container>[/path]]
  4. abfs[s]://[<account[:<password>]@]<container>[/path]

Added query parameters:

  • enable_tls: It replaces blob_storage_scheme and dfs_storage_scheme
    parameters.

Removed query parameters:

  • blob_storage_scheme: Replaced with enable_tls.
  • dfs_storage_scheme: Replaced with enable_tls.

Changed query parameters:

  • credential_kind: Accepts only default, anonymous and
    workload_identity.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Mar 7, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 8, 2024
@kou
Copy link
Member Author

kou commented Mar 8, 2024

I'll merge this in the next week if nobody objects it.

@kou kou merged commit 605f8a7 into apache:main Mar 11, 2024
34 of 35 checks passed
@kou kou deleted the cpp-azurefs-from-uri branch March 11, 2024 21:34
@kou kou removed the awaiting change review Awaiting change review label Mar 11, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 605f8a7.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

bkietz added a commit that referenced this pull request Mar 14, 2024
### Rationale for this change

Failure to rebase and build when merging #39067 (which renamed `internal::Uri` -> `util::Uri`) led to a merge conflict since #40325 added more usages of `internal::Uri`

### What changes are included in this PR?
Rename internal::Uri -> util::Uri

### Are these changes tested?
Yes

### Are there any user-facing changes?
No

* GitHub Issue: #40562

Authored-by: Benjamin Kietzman <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants