-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++][FS][Azure] Should GetFileInfo()
against a directory always return true without hierarchical namespace support?
#38772
Comments
I thought a bit more about this idea and realized always returning I think our approach of doing nothing on Status CreateEmptyDir(const std::string& bucket, const std::string& key) {
DCHECK(!key.empty());
return CreateEmptyObject(bucket, key + kSep);
} cc @Tom-Newton |
Yeah, I saw the "directory markers" thing on the S3 filesystem. Does anyone know if that is an arrow thing or more common pattern with S3? Personally I assumed that was a common pattern in S3 and I'm not aware of the same pattern existing in Azure. My only concern with this approach would be when listing blobs. We could handle it in the arrow filesystem but if anybody tries to list with a different blob storage client they might get a confusing result. |
It looks like the arrow GCS filesystem also creates directory markers in the same way as S3 arrow/cpp/src/arrow/filesystem/gcsfs.cc Line 407 in c1b12ca
If we chose to create directory markers it would probably be good to add some metadata to the blob to indicate that it's actually a directory. I believe hierarchical namespace accounts add |
We're trying to impose filesystem semantics on top of a blob store and filesystem semantics require that after a
This might not be very relevant here because the goal of the Arrow Filesystem API is to provide an uniform interface on top of regular filesystems and blob storage systems. It's pretty clear that the Azure fsspec is violating filesystem semantics of directory management in this case. |
If the Azure fsspec has ways of providing believable filesystem semantics, then we could adopt their patterns. The main thing is: we should provide filesystem semantics and no-op for |
I think I'm in agreement. If we want to be rigorous we need to add directory markers in I think the current implementation of Probably I should also have added checks for metadata indicating a blob is a directory marker on |
@felipecrv @Tom-Newton Do either of you want to work on this? |
I might be able to do it in about a week |
I will take this one. |
…ccount doesn't support HNS (#39361) ### Rationale for this change The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it. ### What changes are included in this PR? - Skip fewer tests - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs` - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error - Add comprehensive tests to `CreateDir` - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts) - ### Are these changes tested? Yes. By existing and new tests added by this PR itself. * Closes: #38772 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…rage account doesn't support HNS (apache#39361) ### Rationale for this change The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it. ### What changes are included in this PR? - Skip fewer tests - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs` - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error - Add comprehensive tests to `CreateDir` - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts) - ### Are these changes tested? Yes. By existing and new tests added by this PR itself. * Closes: apache#38772 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…rage account doesn't support HNS (apache#39361) ### Rationale for this change The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it. ### What changes are included in this PR? - Skip fewer tests - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs` - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error - Add comprehensive tests to `CreateDir` - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts) - ### Are these changes tested? Yes. By existing and new tests added by this PR itself. * Closes: apache#38772 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
…rage account doesn't support HNS (apache#39361) ### Rationale for this change The `FileSystem` implementation based on Azure Blob Storage should implement directory operations according to filesystem semantics. When Hierarchical Namespace (HNS) is enabled, we can rely on Azure Data Lake Storage Gen 2 APIs implementing the filesystem semantics for us, but when all we have is the Blobs API, we should emulate it. ### What changes are included in this PR? - Skip fewer tests - Re-implement `GetFileInfo` using `ListBlobsByHierarchy` instead of `ListBlobs` - Re-implement `CreateDir` with an upfront HNS support check instead of falling back to Blobs API after an error - Add comprehensive tests to `CreateDir` - Add `HasSubmitBatchBug` to check if a test inside any scenario is affected by a certain Azurite issue - Implement `DeleteDir` to work properly on flat namespace storage accounts (non-HNS accounts) - ### Are these changes tested? Yes. By existing and new tests added by this PR itself. * Closes: apache#38772 Authored-by: Felipe Oliveira Carvalho <[email protected]> Signed-off-by: Felipe Oliveira Carvalho <[email protected]>
Describe the enhancement requested
Because
CreateDir()
without hierarchical namespace support does nothing and returnsarrow::Status::OK()
.See also the discussion: #38708 (comment)
Component(s)
C++
The text was updated successfully, but these errors were encountered: