Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow concurrent header metadata reads #4920

Closed
wants to merge 2 commits into from

Conversation

nazar-pc
Copy link
Contributor

@nazar-pc nazar-pc commented Jul 1, 2024

Looked like a bug to me that defeated the purpose of using RwLock

@paritytech-cicd-pr
Copy link

The CI pipeline was cancelled due to failure one of the required jobs.
Job name: cargo-clippy
Logs: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/6596046

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Jul 1, 2024

Ah... it actually need mutable access. Any concerns about replacing .get() with .peek()? I suspect newew blocks will be inserted later and maybe this is what is desired in this case? Otherwise RwLock is not beneficial at all and better to be replaced with Mutex, though I'd argue concurrent reads are very important.

@bkchr
Copy link
Member

bkchr commented Jul 1, 2024

Ah... it actually need mutable access. Any concerns about replacing .get() with .peek()?

Would probably defeat the purpose of having lru cache internally. Maybe having some thread local cache that gets merged back to the main cache could work.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Jul 1, 2024

Would probably defeat the purpose of having lru cache internally

It'll essentially store recently inserted instead of recently used data, which might still be useful.

Maybe having some thread local cache that gets merged back to the main cache could work

Sounds complex 😅

github-merge-queue bot pushed a commit that referenced this pull request Jul 5, 2024
This PR largely fixes
#4903 by addressing it
from a few different directions.

The high-level observation is that complexity of finalization was
unfortunately roughly `O(n^3)`. Not only
`displaced_leaves_after_finalizing` was extremely inefficient on its
own, especially when large ranges of blocks were involved, it was called
once upfront and then on every single block that was finalized over and
over again.

The first commit refactores code adjacent to
`displaced_leaves_after_finalizing` to optimize memory allocations. For
example things like `BTreeMap<_, Vec<_>>` were very bad in terms of
number of allocations and after analyzing code paths was completely
unnecessary and replaced with `Vec<(_, _)>`. In other places allocations
of known size were not done upfront and some APIs required unnecessary
cloning of vectors.

I checked invariants and didn't find anything that was violated after
refactoring.

Second commit completely replaces `displaced_leaves_after_finalizing`
implementation with a much more efficient one. In my case with ~82k
blocks and ~13k leaves it takes ~5.4s to finish
`client.apply_finality()` now.

The idea is to avoid querying the same blocks over and over again as
well as introducing temporary local cache for blocks related to leaves
above block that is being finalized as well as local cache of the
finalized branch of the chain. I left some comments in the code and
wrote tests that I belive should check all code invariants for
correctness. `lowest_common_ancestor_multiblock` was removed as
unnecessary and not great in terms of performance API, domain-specific
code should be written instead like done in
`displaced_leaves_after_finalizing`.

After these changes I noticed finalization is still horribly slow,
turned out that even though `displaced_leaves_after_finalizing` was way
faster that before (probably order of magnitude), it was called for
every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge
case was to not call it when finalizing multiple blocks at once until
the very last moment. It works and allows to finish the whole
finalization in just 14 seconds (5.4+5.4 of which are two calls to
`displaced_leaves_after_finalizing`). I'm really not happy with the fact
that `displaced_leaves_after_finalizing` is called twice, but much
heavier refactoring would be necessary to get rid of second call.

---

Next steps:
* assuming the changes are acceptable I'll write prdoc
* #4920 or something
similar in spirit should be implemented to unleash efficient parallelsm
with rayon in `displaced_leaves_after_finalizing`, which will allow to
further (and significant!) scale its performance rather that being
CPU-bound on a single core, also reading database sequentially should
ideally be avoided
* someone should look into removal of the second
`displaced_leaves_after_finalizing` call
* further cleanups are possible if `undo_finalization` can be removed

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Sebastian Kunert <[email protected]>
TomaszWaszczyk pushed a commit to TomaszWaszczyk/polkadot-sdk that referenced this pull request Jul 7, 2024
This PR largely fixes
paritytech#4903 by addressing it
from a few different directions.

The high-level observation is that complexity of finalization was
unfortunately roughly `O(n^3)`. Not only
`displaced_leaves_after_finalizing` was extremely inefficient on its
own, especially when large ranges of blocks were involved, it was called
once upfront and then on every single block that was finalized over and
over again.

The first commit refactores code adjacent to
`displaced_leaves_after_finalizing` to optimize memory allocations. For
example things like `BTreeMap<_, Vec<_>>` were very bad in terms of
number of allocations and after analyzing code paths was completely
unnecessary and replaced with `Vec<(_, _)>`. In other places allocations
of known size were not done upfront and some APIs required unnecessary
cloning of vectors.

I checked invariants and didn't find anything that was violated after
refactoring.

Second commit completely replaces `displaced_leaves_after_finalizing`
implementation with a much more efficient one. In my case with ~82k
blocks and ~13k leaves it takes ~5.4s to finish
`client.apply_finality()` now.

The idea is to avoid querying the same blocks over and over again as
well as introducing temporary local cache for blocks related to leaves
above block that is being finalized as well as local cache of the
finalized branch of the chain. I left some comments in the code and
wrote tests that I belive should check all code invariants for
correctness. `lowest_common_ancestor_multiblock` was removed as
unnecessary and not great in terms of performance API, domain-specific
code should be written instead like done in
`displaced_leaves_after_finalizing`.

After these changes I noticed finalization is still horribly slow,
turned out that even though `displaced_leaves_after_finalizing` was way
faster that before (probably order of magnitude), it was called for
every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge
case was to not call it when finalizing multiple blocks at once until
the very last moment. It works and allows to finish the whole
finalization in just 14 seconds (5.4+5.4 of which are two calls to
`displaced_leaves_after_finalizing`). I'm really not happy with the fact
that `displaced_leaves_after_finalizing` is called twice, but much
heavier refactoring would be necessary to get rid of second call.

---

Next steps:
* assuming the changes are acceptable I'll write prdoc
* paritytech#4920 or something
similar in spirit should be implemented to unleash efficient parallelsm
with rayon in `displaced_leaves_after_finalizing`, which will allow to
further (and significant!) scale its performance rather that being
CPU-bound on a single core, also reading database sequentially should
ideally be avoided
* someone should look into removal of the second
`displaced_leaves_after_finalizing` call
* further cleanups are possible if `undo_finalization` can be removed

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Sebastian Kunert <[email protected]>
@bkchr
Copy link
Member

bkchr commented Jul 17, 2024

In the current form, we can not merge this. A proper implementation will also require more thinking etc. Thus, I'm closing this.

@bkchr bkchr closed this Jul 17, 2024
@nazar-pc nazar-pc deleted the patch-1 branch July 18, 2024 09:31
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this pull request Aug 2, 2024
This PR largely fixes
paritytech#4903 by addressing it
from a few different directions.

The high-level observation is that complexity of finalization was
unfortunately roughly `O(n^3)`. Not only
`displaced_leaves_after_finalizing` was extremely inefficient on its
own, especially when large ranges of blocks were involved, it was called
once upfront and then on every single block that was finalized over and
over again.

The first commit refactores code adjacent to
`displaced_leaves_after_finalizing` to optimize memory allocations. For
example things like `BTreeMap<_, Vec<_>>` were very bad in terms of
number of allocations and after analyzing code paths was completely
unnecessary and replaced with `Vec<(_, _)>`. In other places allocations
of known size were not done upfront and some APIs required unnecessary
cloning of vectors.

I checked invariants and didn't find anything that was violated after
refactoring.

Second commit completely replaces `displaced_leaves_after_finalizing`
implementation with a much more efficient one. In my case with ~82k
blocks and ~13k leaves it takes ~5.4s to finish
`client.apply_finality()` now.

The idea is to avoid querying the same blocks over and over again as
well as introducing temporary local cache for blocks related to leaves
above block that is being finalized as well as local cache of the
finalized branch of the chain. I left some comments in the code and
wrote tests that I belive should check all code invariants for
correctness. `lowest_common_ancestor_multiblock` was removed as
unnecessary and not great in terms of performance API, domain-specific
code should be written instead like done in
`displaced_leaves_after_finalizing`.

After these changes I noticed finalization is still horribly slow,
turned out that even though `displaced_leaves_after_finalizing` was way
faster that before (probably order of magnitude), it was called for
every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge
case was to not call it when finalizing multiple blocks at once until
the very last moment. It works and allows to finish the whole
finalization in just 14 seconds (5.4+5.4 of which are two calls to
`displaced_leaves_after_finalizing`). I'm really not happy with the fact
that `displaced_leaves_after_finalizing` is called twice, but much
heavier refactoring would be necessary to get rid of second call.

---

Next steps:
* assuming the changes are acceptable I'll write prdoc
* paritytech#4920 or something
similar in spirit should be implemented to unleash efficient parallelsm
with rayon in `displaced_leaves_after_finalizing`, which will allow to
further (and significant!) scale its performance rather that being
CPU-bound on a single core, also reading database sequentially should
ideally be avoided
* someone should look into removal of the second
`displaced_leaves_after_finalizing` call
* further cleanups are possible if `undo_finalization` can be removed

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Sebastian Kunert <[email protected]>
sfffaaa pushed a commit to peaqnetwork/polkadot-sdk that referenced this pull request Dec 27, 2024
This PR largely fixes
paritytech#4903 by addressing it
from a few different directions.

The high-level observation is that complexity of finalization was
unfortunately roughly `O(n^3)`. Not only
`displaced_leaves_after_finalizing` was extremely inefficient on its
own, especially when large ranges of blocks were involved, it was called
once upfront and then on every single block that was finalized over and
over again.

The first commit refactores code adjacent to
`displaced_leaves_after_finalizing` to optimize memory allocations. For
example things like `BTreeMap<_, Vec<_>>` were very bad in terms of
number of allocations and after analyzing code paths was completely
unnecessary and replaced with `Vec<(_, _)>`. In other places allocations
of known size were not done upfront and some APIs required unnecessary
cloning of vectors.

I checked invariants and didn't find anything that was violated after
refactoring.

Second commit completely replaces `displaced_leaves_after_finalizing`
implementation with a much more efficient one. In my case with ~82k
blocks and ~13k leaves it takes ~5.4s to finish
`client.apply_finality()` now.

The idea is to avoid querying the same blocks over and over again as
well as introducing temporary local cache for blocks related to leaves
above block that is being finalized as well as local cache of the
finalized branch of the chain. I left some comments in the code and
wrote tests that I belive should check all code invariants for
correctness. `lowest_common_ancestor_multiblock` was removed as
unnecessary and not great in terms of performance API, domain-specific
code should be written instead like done in
`displaced_leaves_after_finalizing`.

After these changes I noticed finalization is still horribly slow,
turned out that even though `displaced_leaves_after_finalizing` was way
faster that before (probably order of magnitude), it was called for
every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge
case was to not call it when finalizing multiple blocks at once until
the very last moment. It works and allows to finish the whole
finalization in just 14 seconds (5.4+5.4 of which are two calls to
`displaced_leaves_after_finalizing`). I'm really not happy with the fact
that `displaced_leaves_after_finalizing` is called twice, but much
heavier refactoring would be necessary to get rid of second call.

---

Next steps:
* assuming the changes are acceptable I'll write prdoc
* paritytech#4920 or something
similar in spirit should be implemented to unleash efficient parallelsm
with rayon in `displaced_leaves_after_finalizing`, which will allow to
further (and significant!) scale its performance rather that being
CPU-bound on a single core, also reading database sequentially should
ideally be avoided
* someone should look into removal of the second
`displaced_leaves_after_finalizing` call
* further cleanups are possible if `undo_finalization` can be removed

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Sebastian Kunert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants