-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-linear performance for large size of delta log #442
Comments
I'm willing to look into this tomorrow, this seems relatively easy to fix (based on the assumption that my analysis above is correct). |
I disabled the line containing the
|
I was thinking about the same thing while reviewing #431. We should definitely apply action state change in batch instead of one at a time. |
I think you just found a bug in our checkpoint implementation ;) Based on https://github.com/delta-io/delta/blob/master/PROTOCOL.md#action-reconciliation, this should never happen. cc @xianwill @mosyp this unnecessarily increased our checkpoint size. |
If add / remove for the same file could be in one checkpoint, the final state would depend on the order of observing them. I'm glad the spec does not allow that. I completely missed #431 btw, good that there's already something done. I repeated my tests with the newest revision, but it doesn't seem to change anything, probably because none of the remove action actually matches any of the add actions so we walk the vec to the very end for each tombstone anyway. I suspect that it improves performance for other use cases, though 👍 |
That's quite strange. This means we commit a remove action immediately after a file has been added? Anyway, I think this won't be a problem anymore once we move to batch update. BTW, @dispanser I am not actively working on these performance issues because I assume you are interested in solving them. If that's not the case, please let me know, I will be more than happy to help. |
I try to formulate better: when loading a checkpoint, we never have a "remove" that matches an "add", because they'd be part of the same checkpoint, so the optimization from #431 can't be applied. Every add is kept, so the search (
Indeed. I'm having a day off today, and I don't see a better way to spend my time than learning some rust ;). I'll try to come up with something today. |
Oh, i see. Yeah, for a valid checkpoints, this optimization would have no effect. It only works for incremental version upgrades. |
Improve tombstone handling by only applying the remove actions as a batch at the end of a load or incremental update operation. For checkpoint loading, we entirely skip interpretation of the remove actions because according to the delta spec, a remove must be in a later revision than the associated add, so they wouldn't be both in the same checkpoint.
In an attempt to create a delta table for #425, I attempted to load a relatively large delta log containing ~ 700k add actions and 9.5million remove actions.
Load performance is extremely slow, and it gets worse the farther inside the load we progress. I've logged the time it takes to handle each separate checkpoint file:
It's seems that the load becomes slower and slower the more actions are already loaded. As this is not the case with a table that only includes add actions (see #435: 2 minutes for 5million files, no tombstones), the culprit must be in the remove actions.
This brings non-linearity: for each add action we seem to make a linear scan through the entire set of adds. The more add actions we have, the longer that takes.
One possible solution would be to collect all actions, build a set of all files that must be removed, and do that in one pass at the very end.
Also, it seems counter-intuitive for a checkpoint to contain both an add and a remove operation for the same file; I suspect that the behavior can be entirely skipped when loading a checkpoint, and only needs to be applied to (relatively small) incremental update operations.
``
The text was updated successfully, but these errors were encountered: