Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust writer in operations makes a lot of data copies #1394

Closed
wjones127 opened this issue May 27, 2023 · 3 comments · Fixed by #1397
Closed

Rust writer in operations makes a lot of data copies #1394

wjones127 opened this issue May 27, 2023 · 3 comments · Fixed by #1397
Assignees
Labels
binding/rust Issues for the Rust crate bug Something isn't working

Comments

@wjones127
Copy link
Collaborator

Environment

We keep the Parquet file being written as a Vec<u8>. The write_batch method seems to clone the vec each time it is called, and the batch size for many operations is quite small, defaulting at 1024 rows.

fn write_batch(&mut self, batch: &RecordBatch) -> DeltaResult<()> {
// copy current cursor bytes so we can recover from failures
// TODO is copying this something we should be doing?
let buffer_bytes = self.buffer.to_vec();
match self.arrow_writer.write(batch) {
Ok(_) => {
apply_null_counts(&batch.clone().into(), &mut self.null_counts, 0);
Ok(())
}
Err(err) => {
// if a write fails we need to reset the state of the PartitionWriter
warn!("error writing to arrow buffer, resetting writer state.");
self.replace_arrow_buffer(buffer_bytes)?;
Err(err.into())
}
}
}

@roeap Do you remember why we do this? I'm wondering whether we can just rip out this error handling, or if we actually do some sort of retry with this.

@wjones127 wjones127 added bug Something isn't working binding/rust Issues for the Rust crate labels May 27, 2023
@roeap
Copy link
Collaborator

roeap commented May 27, 2023

I do remember wondering the same thing (hence the TODO 😆). This just got carried over from previous implementations and I think originated somewhere in kafaka-delta ingest. I guess back then the write API exposed was much lower level - i.e. user would have to use the writer structs directly.

In our codebase here I think we are not retrying anything. Also did a quick dive into what errors we may actually see there, and I think we either have an IO issue (where we do retry in object store) or somehow malformed data (did not go too deep into the parquet crate). So my vote would be for ripping it out.

@wjones127 wjones127 self-assigned this May 27, 2023
@wjones127
Copy link
Collaborator Author

Okay. I'm going to refactor that module now.

@roeap
Copy link
Collaborator

roeap commented May 27, 2023

I just opened #1396, which may be relevant in that context, as it also affects the write path.

wjones127 added a commit that referenced this issue May 28, 2023
# Description

* Removed the data copies in a tight loop, which were extremely bad for
performance when writing files > 100MB.
* Rewrote statistics handling to collect null values from metadata, just
like min and max.
* Added support for more types in statistics.

# Related Issue(s)

- closes #1394
- closes #1209 
- closes #1208


# Documentation

<!---
Share links to useful documentation
--->
roeap pushed a commit to roeap/delta-rs that referenced this issue Jun 2, 2023
# Description

* Removed the data copies in a tight loop, which were extremely bad for
performance when writing files > 100MB.
* Rewrote statistics handling to collect null values from metadata, just
like min and max.
* Added support for more types in statistics.

# Related Issue(s)

- closes delta-io#1394
- closes delta-io#1209 
- closes delta-io#1208


# Documentation

<!---
Share links to useful documentation
--->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants