Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement GetFileInfo for a single file in Azure filesystem #38335

Closed
Tom-Newton opened this issue Oct 18, 2023 · 4 comments · Fixed by #38505
Closed

[C++] Implement GetFileInfo for a single file in Azure filesystem #38335

Tom-Newton opened this issue Oct 18, 2023 · 4 comments · Fixed by #38505

Comments

@Tom-Newton
Copy link
Contributor

Tom-Newton commented Oct 18, 2023

Describe the enhancement requested

Implement Result<FileInfo> AzureFileSystem::GetFileInfo(const std::string& path).

Logic I suggest we apply:

  1. Parse the string path to AzurePath.
  2. If path.container.empty() then this is the root of the storage account. Its a directory and we don't really have any more info to add.
  3. If path.path_to_file.empty() then this is the root of a container. Try to fetch properties of the container. If the container exists and not IsDeleted() then its a directory and probably we can add the last modified time.
  4. Call BlobClient::GetProperties(). If the storage account is hierarchical namespace it possible that we will get the properties for a directory, so we should always check this https://github.com/Azure/azure-sdk-for-cpp/blob/dd236311193c6a3debf3b12c47f14e49a20c72c7/sdk/storage/azure-storage-files-datalake/src/datalake_utilities.cpp#L86-L91.

All of this only needs the simple blob endpoints - nothing exclusive to hierarchical namespace enabled accounts.

The other GetFileInfo method requires doing directory listing which will be some additional complexity so I propose implementing that in a separate PR.

Related Issues:

Component(s)

C++

@Tom-Newton
Copy link
Contributor Author

It seems I don't actually have permissions to assign myself, without opening a PR but I'm going to start working on this.

@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Oct 29, 2023

Initially I assumed that the correct behaviour would be to return FileType::NotFound if the specified blob does not exist, given that blob storage does really have directories. However having looked closer at the s3 and gcs implementations it looks like they try listing the path and consider the path to be a "directory" if listing returns at least one result.

Presumably we want to implement the same thing for Azure, and probably this should follow a slightly different code path if hierarchical namespace is enabled. With HNS we can avoid the extra listing operation, so that would provide a performance advantage. This makes things more complicated but this was inevitable at some point so I guess I may as well do it now.

@kou
Copy link
Member

kou commented Oct 30, 2023

It seems I don't actually have permissions to assign myself, without opening a PR but I'm going to start working on this.

We can use "take" only comment for it: https://arrow.apache.org/docs/developers/bug_reports.html#issue-assignment

Tom-Newton added a commit to Tom-Newton/arrow that referenced this issue Nov 2, 2023
@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Nov 5, 2023

I think this is ready for review now, so hopefully what I've done makes sense. #38505

Unfortunately I could not test all the hierarchical namespace related stuff with azurite. I created some tests for testing against real blob storage accounts which will be automatically skipped if the required storage account details are not provided. If anyone has any ideas on how to handle this better, that would be great. Potentially CI could be configured to run these but for now at least its possible for developers to run them manually.

Tom-Newton added a commit to Tom-Newton/arrow that referenced this issue Nov 5, 2023
Tom-Newton added a commit to Tom-Newton/arrow that referenced this issue Nov 7, 2023
kou pushed a commit to Tom-Newton/arrow that referenced this issue Nov 9, 2023
@kou kou closed this as completed in #38505 Nov 9, 2023
kou added a commit that referenced this issue Nov 9, 2023
…lesystem (#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(GH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: #38335

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 15.0.0 milestone Nov 9, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…ure filesystem (apache#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: apache#38335

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ure filesystem (apache#38505)

### Rationale for this change

`GetFileInfo` is an important part of an Arrow filesystem implementation. 

### What changes are included in this PR?
- Start `azurefs_internal` similar to GCS and S3 filesystems. 
- Implement `HierarchicalNamespaceDetector`. 
  - This does not use the obvious and simple implementation. It uses a more complicated option inspired by `hadoop-azure` that avoids requiring the significantly elevated permissions needed for `blob_service_client->GetAccountInfo()`.
  - This can't be detected an initialisation time of the filesystem because it requires a `container_name`.  Its packed into its only class so that the result can be cached. 
- Implement `GetFileInfo` for single paths. 
  - Supports hierarchical or flat namespace accounts and takes advantage of hierarchical namespace where possible to avoid unnecessary extra calls to blob storage. The performance difference is actually noticeable just from running the `GetFileInfoObjectWithNestedStructure` test against real flat and hierarchical accounts.  Its about 3 seconds with hierarchical namespace or 5 seconds with a flat namespace.
- Update tests with TODO(apacheGH-38335) to now use this implementation of `GetFileInfo` to replace the temporary direct Azure SDK usage.
- Rename the main test fixture and introduce new ones for connecting to real blob storage. If details of real blob storage is not provided then the real blob storage tests will be skipped. 

### Are these changes tested?

Yes. There are new Azurite based tests for everything that can be tested with Azurite. 

There are also some tests that are designed to test against a real blob storage account. This is because [Azurite cannot emulate a hierarchical namespace account](Azure/Azurite#553). Additionally some of the behaviour used to detect a hierarchical namespace account is different on Azurite compared to a real flat namespace account. These tests will be automatically skipped unless environment variables are provided with details for connecting to the relevant real storage accounts. 

Initially I based the tests on the GCS filesystem but I added a few extras where I thought it was appropriate. 

### Are there any user-facing changes?
Yes. `GetFileInfo` is now supported on the Azure filesystem. 

* Closes: apache#38335

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants