-
Notifications
You must be signed in to change notification settings - Fork 413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transaction log parsing performance regression #2760
Comments
interesting find, thanks for the report! 👋 |
@Tom-Newton If I am bored next weekend, I can take a look at the second issue ; ) |
@Tom-Newton does this have some negligible change for you? #2764 |
Thanks @ion-elgreco it looks like this did provide a measurable improvement It looks like its still about 41-55% slower than |
@Tom-Newton awesome! Thanks for checking this! Yeah so like you mentioned before we read the files twice. I was thinking of perhaps modifying the commit_stream/checkpoint_stream so that it caches the result prior to returning. Then the stream function should also be able to drain the cache when it's executed for the second time. |
@Tom-Newton I'm curious if you can run your test with the latest 0.19.0 that just released, I'm very curious how well @roeap 's optimizations might have also helped your usecase 😄 |
Its looking good I assume it was #2772 that made the difference, but I'm a bit confused about how. |
✨ magic! ✨ more efficient use of the checkpoints and other metadata in the log is likely causing fewer log files to be fetched from storage. Either way, thank you so much for providing pretty graphs! I'm going to stick a fork in this turkey and call it done! |
@rtyler I do believe we can squeeze more performance out though |
I think we are still reading the checkpoint and subsequent log files twice but with the new optimisations we're taking advantage of the features of parquet to not download columns and row groups of the checkpoint that we don't need. This seems to be enough to edge out the 0.10.1 performance even when doing that twice. Anyway, thanks everyone 🙂. I'm impressed how quickly this was resolved. |
I'll leave it open though, to not forget to look into this Actually will just create new issue instead |
saw this late but one thing that may have helped also was #2717 this should result in only reading the columns of parquet necessary from the checkpoint for the actions being queried |
Environment
Delta-rs version: 0.18.2
Binding: Python
Environment:
Bug
What happened:
Performance regression in transaction log parsing compared to deltalake 0.10.1
Y axis is time in seconds.
What you expected to happen:
New versions of
deltalake
to be the same or better performance than older versions.How to reproduce it:
Compare the time taken for this compared to the same thing when using
deltalake
0.10.1
.More details:
I think I have identified the 2 reasons for the performance regression.
deltalake
now relies onObjectStore.list_with_offset
which uses an inefficient implementation for Azure but is probably advantageous on GCS or S3.MicrosoftAzure
store apache/arrow-rs#6174 (comment) is trying to solve it and we're trying to get Azure to help with that.deltalake
where it iteratively checks if commit versions exist, instead of using a list operation. This gives us the "0.18.2 modified" results in my graph above.list_with_offset
was introduced by feat: buffered reading of transaction logs #1549(Reporting as a bug seemed more appropriate than a feature request but is not ideal)
The text was updated successfully, but these errors were encountered: