-
Notifications
You must be signed in to change notification settings - Fork 754
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow concurrent header metadata reads #4920
Conversation
The CI pipeline was cancelled due to failure one of the required jobs. |
Ah... it actually need mutable access. Any concerns about replacing |
Would probably defeat the purpose of having lru cache internally. Maybe having some thread local cache that gets merged back to the main cache could work. |
It'll essentially store recently inserted instead of recently used data, which might still be useful.
Sounds complex 😅 |
This PR largely fixes #4903 by addressing it from a few different directions. The high-level observation is that complexity of finalization was unfortunately roughly `O(n^3)`. Not only `displaced_leaves_after_finalizing` was extremely inefficient on its own, especially when large ranges of blocks were involved, it was called once upfront and then on every single block that was finalized over and over again. The first commit refactores code adjacent to `displaced_leaves_after_finalizing` to optimize memory allocations. For example things like `BTreeMap<_, Vec<_>>` were very bad in terms of number of allocations and after analyzing code paths was completely unnecessary and replaced with `Vec<(_, _)>`. In other places allocations of known size were not done upfront and some APIs required unnecessary cloning of vectors. I checked invariants and didn't find anything that was violated after refactoring. Second commit completely replaces `displaced_leaves_after_finalizing` implementation with a much more efficient one. In my case with ~82k blocks and ~13k leaves it takes ~5.4s to finish `client.apply_finality()` now. The idea is to avoid querying the same blocks over and over again as well as introducing temporary local cache for blocks related to leaves above block that is being finalized as well as local cache of the finalized branch of the chain. I left some comments in the code and wrote tests that I belive should check all code invariants for correctness. `lowest_common_ancestor_multiblock` was removed as unnecessary and not great in terms of performance API, domain-specific code should be written instead like done in `displaced_leaves_after_finalizing`. After these changes I noticed finalization is still horribly slow, turned out that even though `displaced_leaves_after_finalizing` was way faster that before (probably order of magnitude), it was called for every single of those 82k blocks 🤦 The quick hack I came up with in the third commit to handle this edge case was to not call it when finalizing multiple blocks at once until the very last moment. It works and allows to finish the whole finalization in just 14 seconds (5.4+5.4 of which are two calls to `displaced_leaves_after_finalizing`). I'm really not happy with the fact that `displaced_leaves_after_finalizing` is called twice, but much heavier refactoring would be necessary to get rid of second call. --- Next steps: * assuming the changes are acceptable I'll write prdoc * #4920 or something similar in spirit should be implemented to unleash efficient parallelsm with rayon in `displaced_leaves_after_finalizing`, which will allow to further (and significant!) scale its performance rather that being CPU-bound on a single core, also reading database sequentially should ideally be avoided * someone should look into removal of the second `displaced_leaves_after_finalizing` call * further cleanups are possible if `undo_finalization` can be removed --- Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd --------- Co-authored-by: Sebastian Kunert <[email protected]>
This PR largely fixes paritytech#4903 by addressing it from a few different directions. The high-level observation is that complexity of finalization was unfortunately roughly `O(n^3)`. Not only `displaced_leaves_after_finalizing` was extremely inefficient on its own, especially when large ranges of blocks were involved, it was called once upfront and then on every single block that was finalized over and over again. The first commit refactores code adjacent to `displaced_leaves_after_finalizing` to optimize memory allocations. For example things like `BTreeMap<_, Vec<_>>` were very bad in terms of number of allocations and after analyzing code paths was completely unnecessary and replaced with `Vec<(_, _)>`. In other places allocations of known size were not done upfront and some APIs required unnecessary cloning of vectors. I checked invariants and didn't find anything that was violated after refactoring. Second commit completely replaces `displaced_leaves_after_finalizing` implementation with a much more efficient one. In my case with ~82k blocks and ~13k leaves it takes ~5.4s to finish `client.apply_finality()` now. The idea is to avoid querying the same blocks over and over again as well as introducing temporary local cache for blocks related to leaves above block that is being finalized as well as local cache of the finalized branch of the chain. I left some comments in the code and wrote tests that I belive should check all code invariants for correctness. `lowest_common_ancestor_multiblock` was removed as unnecessary and not great in terms of performance API, domain-specific code should be written instead like done in `displaced_leaves_after_finalizing`. After these changes I noticed finalization is still horribly slow, turned out that even though `displaced_leaves_after_finalizing` was way faster that before (probably order of magnitude), it was called for every single of those 82k blocks 🤦 The quick hack I came up with in the third commit to handle this edge case was to not call it when finalizing multiple blocks at once until the very last moment. It works and allows to finish the whole finalization in just 14 seconds (5.4+5.4 of which are two calls to `displaced_leaves_after_finalizing`). I'm really not happy with the fact that `displaced_leaves_after_finalizing` is called twice, but much heavier refactoring would be necessary to get rid of second call. --- Next steps: * assuming the changes are acceptable I'll write prdoc * paritytech#4920 or something similar in spirit should be implemented to unleash efficient parallelsm with rayon in `displaced_leaves_after_finalizing`, which will allow to further (and significant!) scale its performance rather that being CPU-bound on a single core, also reading database sequentially should ideally be avoided * someone should look into removal of the second `displaced_leaves_after_finalizing` call * further cleanups are possible if `undo_finalization` can be removed --- Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd --------- Co-authored-by: Sebastian Kunert <[email protected]>
In the current form, we can not merge this. A proper implementation will also require more thinking etc. Thus, I'm closing this. |
This PR largely fixes paritytech#4903 by addressing it from a few different directions. The high-level observation is that complexity of finalization was unfortunately roughly `O(n^3)`. Not only `displaced_leaves_after_finalizing` was extremely inefficient on its own, especially when large ranges of blocks were involved, it was called once upfront and then on every single block that was finalized over and over again. The first commit refactores code adjacent to `displaced_leaves_after_finalizing` to optimize memory allocations. For example things like `BTreeMap<_, Vec<_>>` were very bad in terms of number of allocations and after analyzing code paths was completely unnecessary and replaced with `Vec<(_, _)>`. In other places allocations of known size were not done upfront and some APIs required unnecessary cloning of vectors. I checked invariants and didn't find anything that was violated after refactoring. Second commit completely replaces `displaced_leaves_after_finalizing` implementation with a much more efficient one. In my case with ~82k blocks and ~13k leaves it takes ~5.4s to finish `client.apply_finality()` now. The idea is to avoid querying the same blocks over and over again as well as introducing temporary local cache for blocks related to leaves above block that is being finalized as well as local cache of the finalized branch of the chain. I left some comments in the code and wrote tests that I belive should check all code invariants for correctness. `lowest_common_ancestor_multiblock` was removed as unnecessary and not great in terms of performance API, domain-specific code should be written instead like done in `displaced_leaves_after_finalizing`. After these changes I noticed finalization is still horribly slow, turned out that even though `displaced_leaves_after_finalizing` was way faster that before (probably order of magnitude), it was called for every single of those 82k blocks 🤦 The quick hack I came up with in the third commit to handle this edge case was to not call it when finalizing multiple blocks at once until the very last moment. It works and allows to finish the whole finalization in just 14 seconds (5.4+5.4 of which are two calls to `displaced_leaves_after_finalizing`). I'm really not happy with the fact that `displaced_leaves_after_finalizing` is called twice, but much heavier refactoring would be necessary to get rid of second call. --- Next steps: * assuming the changes are acceptable I'll write prdoc * paritytech#4920 or something similar in spirit should be implemented to unleash efficient parallelsm with rayon in `displaced_leaves_after_finalizing`, which will allow to further (and significant!) scale its performance rather that being CPU-bound on a single core, also reading database sequentially should ideally be avoided * someone should look into removal of the second `displaced_leaves_after_finalizing` call * further cleanups are possible if `undo_finalization` can be removed --- Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd --------- Co-authored-by: Sebastian Kunert <[email protected]>
This PR largely fixes paritytech#4903 by addressing it from a few different directions. The high-level observation is that complexity of finalization was unfortunately roughly `O(n^3)`. Not only `displaced_leaves_after_finalizing` was extremely inefficient on its own, especially when large ranges of blocks were involved, it was called once upfront and then on every single block that was finalized over and over again. The first commit refactores code adjacent to `displaced_leaves_after_finalizing` to optimize memory allocations. For example things like `BTreeMap<_, Vec<_>>` were very bad in terms of number of allocations and after analyzing code paths was completely unnecessary and replaced with `Vec<(_, _)>`. In other places allocations of known size were not done upfront and some APIs required unnecessary cloning of vectors. I checked invariants and didn't find anything that was violated after refactoring. Second commit completely replaces `displaced_leaves_after_finalizing` implementation with a much more efficient one. In my case with ~82k blocks and ~13k leaves it takes ~5.4s to finish `client.apply_finality()` now. The idea is to avoid querying the same blocks over and over again as well as introducing temporary local cache for blocks related to leaves above block that is being finalized as well as local cache of the finalized branch of the chain. I left some comments in the code and wrote tests that I belive should check all code invariants for correctness. `lowest_common_ancestor_multiblock` was removed as unnecessary and not great in terms of performance API, domain-specific code should be written instead like done in `displaced_leaves_after_finalizing`. After these changes I noticed finalization is still horribly slow, turned out that even though `displaced_leaves_after_finalizing` was way faster that before (probably order of magnitude), it was called for every single of those 82k blocks 🤦 The quick hack I came up with in the third commit to handle this edge case was to not call it when finalizing multiple blocks at once until the very last moment. It works and allows to finish the whole finalization in just 14 seconds (5.4+5.4 of which are two calls to `displaced_leaves_after_finalizing`). I'm really not happy with the fact that `displaced_leaves_after_finalizing` is called twice, but much heavier refactoring would be necessary to get rid of second call. --- Next steps: * assuming the changes are acceptable I'll write prdoc * paritytech#4920 or something similar in spirit should be implemented to unleash efficient parallelsm with rayon in `displaced_leaves_after_finalizing`, which will allow to further (and significant!) scale its performance rather that being CPU-bound on a single core, also reading database sequentially should ideally be avoided * someone should look into removal of the second `displaced_leaves_after_finalizing` call * further cleanups are possible if `undo_finalization` can be removed --- Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd --------- Co-authored-by: Sebastian Kunert <[email protected]>
Looked like a bug to me that defeated the purpose of using
RwLock