Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for AsyncWrite #16

Closed
wants to merge 5 commits into from
Closed

feat: add support for AsyncWrite #16

wants to merge 5 commits into from

Conversation

yjshen
Copy link
Contributor

@yjshen yjshen commented Jul 26, 2023

Discovered the remarkable pull request apache/datafusion#6987, which enables writing data through the Object Store API with AsyncWriter.

We can support writing directly to HDFS once we add support for the put_multipart and abort_multipart APIs.

@yjshen yjshen changed the title feat: update object_store to 0.6.1, add support for AsyncWrite feat: add support for AsyncWrite Jul 26, 2023
struct HdfsMultiPartUpload {
location: Path,
hdfs: Arc<HdfsFs>,
content: Arc<Mutex<HashMap<usize, Vec<u8>>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may lead to too much memory pressure, since it has to keep all of the contents in memory before sending to the HDFS.

@yahoNanJing
Copy link
Collaborator

It would be better to add a consumer to consume the minimum part of the cached file continuously and then append to the HDFS file to reduce the memory pressure.

To achieve this, we can add a current minimum part index in the HdfsMultiPartUpload. When invoking put_multipart_part, we can check whether this part is for the minimum part. If so, it can notify the consumer. The consumer will consume the minimum part stored in the cache and update the current minimum part index. Continue this step until there's no cached part for the current minimum part index.

@yjshen
Copy link
Contributor Author

yjshen commented Aug 24, 2023

Fixing the comments in the other PR: #17, close this.

@yjshen yjshen closed this Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants