-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Implement file writes for Azure filesystem #38333
Comments
take |
This implements
Do we need a sub issue for |
Done: See the GH-18014 description for each sub issue. |
I was planning to do it in this one. I think the only difference is that append stream doesn't truncate the file if it already exists. |
I see! |
### Rationale for this change Writing files is an important part of the filesystem ### What changes are included in this PR? Implements `OpenOutputStream` and `OpenAppendStream` for Azure. - Initially I started with the implementation from #12914 but I made quite a few changes: - Removed the different code path for hierarchical namespace accounts. There should not be any performance advantage to using special APIs only available on hierachical namespace accounts. - Only implement `ObjectAppendStream`, not `ObjectOutputStream`. `OpenOutputStream` is implemented by truncating the existing file then returning a `ObjectAppendStream`. - More precise use of `try` `catch`. Every call to Azure is wrapped in a `try` `catch` and should return a descriptive error status. - Avoid unnecessary calls to Azure. For example we now maintain the block list in memory and commit it only once on flush. #12914 committed the block list after each block that was staged and on flush queried Azure to get the list of uncommitted blocks. The new approach is consistent with the Azure fsspec implementation https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L2009 - Adjust the block_ids slightly to minimise the risk of them conflicting with blocks written by other blob storage clients. - Implement metadata writes. Includes adding default metadata to `AzureOptions`. - Tests are based on the `gscfs_test.cc` but I added a couple of extra. - Handle the TODO(GH-38780) comments for using the Azure fs to write data in tests ### Are these changes tested? Yes. Everything should be covered by azurite tests ### Are there any user-facing changes? Yes. The Azure filesystem now supports file writes. * Closes: #38333 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
### Rationale for this change Writing files is an important part of the filesystem ### What changes are included in this PR? Implements `OpenOutputStream` and `OpenAppendStream` for Azure. - Initially I started with the implementation from apache#12914 but I made quite a few changes: - Removed the different code path for hierarchical namespace accounts. There should not be any performance advantage to using special APIs only available on hierachical namespace accounts. - Only implement `ObjectAppendStream`, not `ObjectOutputStream`. `OpenOutputStream` is implemented by truncating the existing file then returning a `ObjectAppendStream`. - More precise use of `try` `catch`. Every call to Azure is wrapped in a `try` `catch` and should return a descriptive error status. - Avoid unnecessary calls to Azure. For example we now maintain the block list in memory and commit it only once on flush. apache#12914 committed the block list after each block that was staged and on flush queried Azure to get the list of uncommitted blocks. The new approach is consistent with the Azure fsspec implementation https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L2009 - Adjust the block_ids slightly to minimise the risk of them conflicting with blocks written by other blob storage clients. - Implement metadata writes. Includes adding default metadata to `AzureOptions`. - Tests are based on the `gscfs_test.cc` but I added a couple of extra. - Handle the TODO(apacheGH-38780) comments for using the Azure fs to write data in tests ### Are these changes tested? Yes. Everything should be covered by azurite tests ### Are there any user-facing changes? Yes. The Azure filesystem now supports file writes. * Closes: apache#38333 Lead-authored-by: Thomas Newton <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Describe the enhancement requested
The same process as for #37511 but for file writes this time.
So probably we need an azure implementation of
io::OutputStream
then that can be used to implement theOpenOutputStream
methods of the AzureFileSystem.#12914 implemented this so that is a good starting point. However I would suggest that we avoid differing logic based on whether the hierarchical namespace (ADLS gen2) is used. Without the convenient
Append
API that only works with hierarchical namespace we will need to upload new blocks then update the block list of the block blob, to implement the same functionality. This method should work regardless of hierarchical namespace, its testable with azurite and and performance should be no different compared to if we use theAppend
API.Related Issues:
Component(s)
C++
The text was updated successfully, but these errors were encountered: