-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Implement GetFileInfo for a single file in Azure filesystem #38335
Comments
It seems I don't actually have permissions to assign myself, without opening a PR but I'm going to start working on this. |
Initially I assumed that the correct behaviour would be to return Presumably we want to implement the same thing for Azure, and probably this should follow a slightly different code path if hierarchical namespace is enabled. With HNS we can avoid the extra listing operation, so that would provide a performance advantage. This makes things more complicated but this was inevitable at some point so I guess I may as well do it now. |
We can use "take" only comment for it: https://arrow.apache.org/docs/developers/bug_reports.html#issue-assignment |
I think this is ready for review now, so hopefully what I've done makes sense. #38505 Unfortunately I could not test all the hierarchical namespace related stuff with |
…lesystem (#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(GH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: #38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…ure filesystem (apache#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: apache#38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…ure filesystem (apache#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: apache#38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Describe the enhancement requested
Implement
Result<FileInfo> AzureFileSystem::GetFileInfo(const std::string& path)
.Logic I suggest we apply:
AzurePath
.path.container.empty()
then this is the root of the storage account. Its a directory and we don't really have any more info to add.path.path_to_file.empty()
then this is the root of a container. Try to fetch properties of the container. If the container exists and notIsDeleted()
then its a directory and probably we can add the last modified time.All of this only needs the simple blob endpoints - nothing exclusive to hierarchical namespace enabled accounts.
The other GetFileInfo method requires doing directory listing which will be some additional complexity so I propose implementing that in a separate PR.
Related Issues:
Component(s)
C++
The text was updated successfully, but these errors were encountered: