Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement file writes for Azure filesystem #38333

Closed
Tom-Newton opened this issue Oct 18, 2023 · 5 comments · Fixed by #38780
Closed

[C++] Implement file writes for Azure filesystem #38333

Tom-Newton opened this issue Oct 18, 2023 · 5 comments · Fixed by #38780

Comments

@Tom-Newton
Copy link
Contributor

Describe the enhancement requested

The same process as for #37511 but for file writes this time.

So probably we need an azure implementation of io::OutputStream then that can be used to implement the OpenOutputStream methods of the AzureFileSystem.

#12914 implemented this so that is a good starting point. However I would suggest that we avoid differing logic based on whether the hierarchical namespace (ADLS gen2) is used. Without the convenient Append API that only works with hierarchical namespace we will need to upload new blocks then update the block list of the block blob, to implement the same functionality. This method should work regardless of hierarchical namespace, its testable with azurite and and performance should be no different compared to if we use the Append API.

Related Issues:

Component(s)

C++

@Tom-Newton
Copy link
Contributor Author

take

@kou
Copy link
Member

kou commented Nov 14, 2023

This implements AzureFileSystem::OpenOuptutStream(), right?
I'll create new sub issues for the following APIs:

  • AzureFileSystem::CreateDir()
  • AzureFileSystem::DeleteDir()
  • AzureFileSystem::DeleteDirContents()
  • AzureFileSystem::DeleteRootDirContents()
  • AzureFileSystem::DeleteFile()
  • AzureFileSystem::Move()
  • AzureFileSystem::CopyFile()

Do we need a sub issue for AzureFileSystem::OpenAppendStream()?

@kou
Copy link
Member

kou commented Nov 14, 2023

Done: See the GH-18014 description for each sub issue.

@Tom-Newton
Copy link
Contributor Author

Do we need a sub issue for AzureFileSystem::OpenAppendStream()?

I was planning to do it in this one. I think the only difference is that append stream doesn't truncate the file if it already exists.

@kou
Copy link
Member

kou commented Nov 14, 2023

I see!

kou added a commit that referenced this issue Nov 21, 2023
### Rationale for this change
Writing files is an important part of the filesystem

### What changes are included in this PR?
Implements `OpenOutputStream` and `OpenAppendStream` for Azure.

- Initially I started with the implementation from #12914 but I made quite a few changes:
  - Removed the different code path for hierarchical namespace accounts. There should not be any performance advantage to using special APIs only available on hierachical namespace accounts. 
  - Only implement `ObjectAppendStream`, not `ObjectOutputStream`. `OpenOutputStream` is implemented by truncating the existing file then returning a `ObjectAppendStream`.
  - More precise use of `try` `catch`. Every call to Azure is wrapped in a `try` `catch` and should return a descriptive error status. 
  - Avoid unnecessary calls to Azure. For example we now maintain the block list in memory and commit it only once on flush. #12914 committed the block list after each block that was staged and on flush queried Azure to get the list of uncommitted blocks. The new approach is consistent with the Azure fsspec implementation https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L2009
  - Adjust the block_ids slightly to minimise the risk of them conflicting with blocks written by other blob storage clients. 
  - Implement metadata writes. Includes adding default metadata to `AzureOptions`.
- Tests are based on the `gscfs_test.cc` but I added a couple of extra. 
- Handle the TODO(GH-38780) comments for using the Azure fs to write data in tests

### Are these changes tested?
Yes. Everything should be covered by azurite tests

### Are there any user-facing changes?
Yes. The Azure filesystem now supports file writes. 

* Closes: #38333

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 15.0.0 milestone Nov 21, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
### Rationale for this change
Writing files is an important part of the filesystem

### What changes are included in this PR?
Implements `OpenOutputStream` and `OpenAppendStream` for Azure.

- Initially I started with the implementation from apache#12914 but I made quite a few changes:
  - Removed the different code path for hierarchical namespace accounts. There should not be any performance advantage to using special APIs only available on hierachical namespace accounts. 
  - Only implement `ObjectAppendStream`, not `ObjectOutputStream`. `OpenOutputStream` is implemented by truncating the existing file then returning a `ObjectAppendStream`.
  - More precise use of `try` `catch`. Every call to Azure is wrapped in a `try` `catch` and should return a descriptive error status. 
  - Avoid unnecessary calls to Azure. For example we now maintain the block list in memory and commit it only once on flush. apache#12914 committed the block list after each block that was staged and on flush queried Azure to get the list of uncommitted blocks. The new approach is consistent with the Azure fsspec implementation https://github.com/fsspec/adlfs/blob/092685f102c5cd215550d10e8347e5bce0e2b93d/adlfs/spec.py#L2009
  - Adjust the block_ids slightly to minimise the risk of them conflicting with blocks written by other blob storage clients. 
  - Implement metadata writes. Includes adding default metadata to `AzureOptions`.
- Tests are based on the `gscfs_test.cc` but I added a couple of extra. 
- Handle the TODO(apacheGH-38780) comments for using the Azure fs to write data in tests

### Are these changes tested?
Yes. Everything should be covered by azurite tests

### Are there any user-facing changes?
Yes. The Azure filesystem now supports file writes. 

* Closes: apache#38333

Lead-authored-by: Thomas Newton <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants