-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-38335: [C++] Implement GetFileInfo
for a single file in Azure filesystem
#38505
GH-38335: [C++] Implement GetFileInfo
for a single file in Azure filesystem
#38505
Conversation
|
544299f
to
4142733
Compare
@@ -78,18 +81,17 @@ struct AzurePath { | |||
"Expected an Azure object path of the form 'container/path...', got a URI: '", | |||
s, "'"); | |||
} | |||
const auto src = internal::RemoveTrailingSlash(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was preventing GetFileInfo
working on directories. The other filesystems did not have this.
FileType::NotFound); | ||
|
||
AssertFileInfo(fs_.get(), PreexistingContainerPath() + "test-empty-object-dir", | ||
FileType::Directory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally I would have liked to add an assertion here which confirms that with the hierarchical namespace there are no calls to ListBlobs
. That would require patching an Azure container client, which I didn't know how to do. If anyone was any suggestions that would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it by adding internal ListBlobs
call counter and exporting it only for testing.
Or we may be able to provide AzureFileSystem::GetStatistics()
and the return value provides statistics including the number of ListBlobs
calles.
(I think that we don't need test it. If we want to test it, we can open a new issue for it and defer it as a separated task to merge this as soon as possible.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm happy to leave out such an assertion at least initially. If it was python I would have done it seems like mocking in C++ would be more complicated even if I did understand the language 😅
1ab87dc
to
28357b0
Compare
Thanks for reviewing kou. I have addressed most of the comments and I should be able to address the remaining ones this evening. |
42e3d31
to
7fe94f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
AzureOptions options_; | ||
internal::HierarchicalNamespaceDetector hierarchical_namespace_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that HierarchicalNamespaceDetector
is enough simple to move to Impl
. (HierarchicalNamespaceDetector::Enabled()
is the only important method in the class.)
How about moving HierarchicalNamespaceDetector::Enabled()
to Impl::IsHierarchicalNamespaceEnabled()
and removing HierarchicalNamespaceDetector
(or something)?
If we do it, we can make datalake_service_client_
std::unique_ptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it separate because I wanted to keep the cached value enabled_
private from the rest of Impl
. I was a bit concerned that people might try to directly access the cached state without realising that everything should use the Enabled()
function. Additionally making it a separate class made it easier to test.
I think one possibility is to use a non-smart pointer in HierarchicalNamespaceDetector
because HierarchicalNamespaceDetector
will always be destructed at the same time as Impl
. https://stackoverflow.com/questions/7657718/when-to-use-shared-ptr-and-when-to-use-raw-pointers. I think that should allow us to use a unique_ptr
for datalake_service_client_
. I think this would be my preferred solution. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to just make my preferred change. If you think its a bad idea I'm happy to change it again to something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let's use the approach.
FileType::NotFound); | ||
|
||
AssertFileInfo(fs_.get(), PreexistingContainerPath() + "test-empty-object-dir", | ||
FileType::Directory); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it by adding internal ListBlobs
call counter and exporting it only for testing.
Or we may be able to provide AzureFileSystem::GetStatistics()
and the return value provides statistics including the number of ListBlobs
calles.
(I think that we don't need test it. If we want to test it, we can open a new issue for it and defer it as a separated task to merge this as soon as possible.)
AzureOptions options_; | ||
internal::HierarchicalNamespaceDetector hierarchical_namespace_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let's use the approach.
The lint failure was fixed by #38639. |
…space check in the case that the result is cached.
737d926
to
0659a39
Compare
I'll merge this. |
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 75a0403. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…ure filesystem (apache#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: apache#38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…ure filesystem (apache#38505) ### Rationale for this change `GetFileInfo` is an important part of an Arrow filesystem implementation. ### What changes are included in this PR? - Start `azurefs_internal` similar to GCS and S3 filesystems. - Implement `HierarchicalNamespaceDetector`. - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`. - This can't be detected an initialisation time of the filesystem because it requires a `container_name`. Its packed into its only class so that the result can be cached. - Implement `GetFileInfo` for single paths. - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace. - Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage. - Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. ### Are these changes tested? Yes. There are new Azurite based tests for everything that can be tested with Azurite. There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. ### Are there any user-facing changes? Yes. `GetFileInfo` is now supported on the Azure filesystem. * Closes: apache#38335 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
GetFileInfo
is an important part of an Arrow filesystem implementation.What changes are included in this PR?
azurefs_internal
similar to GCS and S3 filesystems.HierarchicalNamespaceDetector
.hadoop-azure
that avoids requiring the significantly elevated permissions needed forblob_service_client->GetAccountInfo()
.container_name
. Its packed into its only class so that the result can be cached.GetFileInfo
for single paths.GetFileInfoObjectWithNestedStructure
test against real flat and hierarchical accounts. Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.GetFileInfo
to replace the temporary direct Azure SDK usage.Are these changes tested?
Yes. There are new Azurite based tests for everything that can be tested with Azurite.
There are also some tests that are designed to test against a real blob storage account. This is because Azurite cannot emulate a hierarchical namespace account. Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts.
Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate.
Are there any user-facing changes?
Yes.
GetFileInfo
is now supported on the Azure filesystem.