Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalization hangs in 1.13 #4903

Closed
2 tasks done
nazar-pc opened this issue Jun 27, 2024 · 11 comments · Fixed by #4922
Closed
2 tasks done

Finalization hangs in 1.13 #4903

nazar-pc opened this issue Jun 27, 2024 · 11 comments · Fixed by #4922
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@nazar-pc
Copy link
Contributor

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

After upgrading Substrate from 1.10.1 to 1.13.0, now when finalizing a large-ish range of blocks 2056100..=2071630, it seems to just hang. By "hang" I mean 30 minutes+ (didn't wait much longer yet, it is already ridiculous).

I see this message in logs:

2024-06-27T18:09:43.043948Z WARN Consensus: sp_blockchain::backend: The distance between leafs and finalized block is large. Finalization can take a long time.

Which comes from #4721 that was supposed to introduce some optimizations.

Seems like a significant problem. I see one of the threads using 100% of CPU, but it is tokio's anonymous thread, so I'm not yet sure what it is and what it is doing.

cc @shamil-gadelshin

Steps to reproduce

No response

@nazar-pc nazar-pc added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Jun 27, 2024
@bkchr
Copy link
Member

bkchr commented Jun 28, 2024

Yeah the underlying problem is still: #3962

For a proper solution we need to record when a fork started. This way we don't need to do costly tree route.

@shamil-gadelshin
Copy link
Contributor

Am I correct that the problem comes from calculating the enacted part of the tree_route? @skunert @arkpar @bkchr

If this is the case - we can calculate only a retracted part separately for displaced branches on finalization which will remove the issue.

@bkchr
Copy link
Member

bkchr commented Jun 28, 2024

The problem is more that we need to calculate the route between the last finalized block and every leaf. The retracted blocks are probably not such a big problem, because the leaves are probably around the head of the chain and thus, the retracted blocks are probably not that much.

But you are using some different consensus which I don't know in detail and you may have more, longer forks?

@shamil-gadelshin
Copy link
Contributor

/// Tree route from R1 to En. Retracted is [R1 - R3], Common is C, enacted [E1 - En]
///   <- R3 <- R2 <- R1
///  /
/// C
///  \-> E1 -> E2 -> ... -> En

I understood the problem as such:

  1. When we sync from genesis and finalize the tip - finalization invokes the fork calculation (displaced_leaves_after_finalizing) that invokes tree_route for each leaf.
  2. tree_route calculates both retracted and enacted parts and the enacted part for each leaf is very long.

However, during the fork pruning we don't need to calculate the enacted part. If we calculate only retracted part either separately or by extending tree_route parameters - we won't invoke costly and redundant calculations for each leaf.

Please, correct me if I'm missing something.

@skunert
Copy link
Contributor

skunert commented Jun 28, 2024

@shamil-gadelshin The first attempt to fix this will give you insight into the problem: #4721

Basically if we finalize a distance of 10_000 blocks and 3 leafs, we calculate the tree route between the leafs and the finalized block 10k times for each leaf. The first fix attempt made some assumptions that don't seem to hold.

As @bkchr said, the best fix would be to store the fork root and don't compute the displaced forks by tree route.

@nazar-pc
Copy link
Contributor Author

But you are using some different consensus which I don't know in detail and you may have more, longer forks?

In our consensus we do have forks fairly regularly, but they are usually 1-2 blocks deep and I don't remember seeing anything over 3 blocks in a while, though we do not have monitoring for this.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Jul 1, 2024

I started looking into where slowness comes from and found that it is displaced_leaves_after_finalizing, which finds 12980 leaves and processes them at a rate of a few per second.

I see that displaced_leaves_after_finalizing and tree_route that is called from there both have a bunch of unoptimzed allocations (no reserved capacity), but the slowest piece of code seems to be this block:

	while to.number > from.number {
		to_branch.push(HashAndNumber { number: to.number, hash: to.hash });

		to = backend.header_metadata(to.parent)?;
	}

Judging by the fact that from and to are set to leaf_hash and finalized_block_hash accordingly, this does A LOT of duplicated calls to the same function with the same arguments over and over again. Just ridiculous amount. Above loop along takes ~190ms to complete on my machine.

This test was trying to finalize ~82k of blocks at once.

All of it can also be heavily parallelized, though not sure if it is a welcome thing in this context.

UPD: This is an intereting assumption that is used by default and not really customizable on BlockchainDb level:

/// Set to the expected max difference between `best` and `finalized` blocks at sync.
pub(crate) const LRU_CACHE_SIZE: u32 = 5_000;

@bkchr
Copy link
Member

bkchr commented Jul 1, 2024

Part of the problem is for sure that we finalize every block on its own. We do not just jump to the finalized block and thus, we call all this code multiple times. If you like, feel free to open a pull request to improve the performance of the tree_route function to be more memory optimized. I also started looking into this, so that we can improve the whole operation.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Jul 1, 2024

I tried increasing LRU_CACHE_SIZE to 100k just to see what happens and it improved iteration of above loop by an order of magnitude immediately.

I'll look at optimizing those functions next, looks like there are some low-hanging fruits there.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Jul 1, 2024

I'll send a PR later today or tomorrow with fix for this issue, in the meantime since I seem to touch code introduced back in 2019 I'm wondering if undo_finalization is still needed. It is not used, there are no tests for it and I'm wondering if there is a sane reason to ever call it at all.

@nazar-pc
Copy link
Contributor Author

nazar-pc commented Jul 2, 2024

PR with fix: #4922
There are also next steps in there that would be great to get feedback on.

github-merge-queue bot pushed a commit that referenced this issue Jul 5, 2024
This PR largely fixes
#4903 by addressing it
from a few different directions.

The high-level observation is that complexity of finalization was
unfortunately roughly `O(n^3)`. Not only
`displaced_leaves_after_finalizing` was extremely inefficient on its
own, especially when large ranges of blocks were involved, it was called
once upfront and then on every single block that was finalized over and
over again.

The first commit refactores code adjacent to
`displaced_leaves_after_finalizing` to optimize memory allocations. For
example things like `BTreeMap<_, Vec<_>>` were very bad in terms of
number of allocations and after analyzing code paths was completely
unnecessary and replaced with `Vec<(_, _)>`. In other places allocations
of known size were not done upfront and some APIs required unnecessary
cloning of vectors.

I checked invariants and didn't find anything that was violated after
refactoring.

Second commit completely replaces `displaced_leaves_after_finalizing`
implementation with a much more efficient one. In my case with ~82k
blocks and ~13k leaves it takes ~5.4s to finish
`client.apply_finality()` now.

The idea is to avoid querying the same blocks over and over again as
well as introducing temporary local cache for blocks related to leaves
above block that is being finalized as well as local cache of the
finalized branch of the chain. I left some comments in the code and
wrote tests that I belive should check all code invariants for
correctness. `lowest_common_ancestor_multiblock` was removed as
unnecessary and not great in terms of performance API, domain-specific
code should be written instead like done in
`displaced_leaves_after_finalizing`.

After these changes I noticed finalization is still horribly slow,
turned out that even though `displaced_leaves_after_finalizing` was way
faster that before (probably order of magnitude), it was called for
every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge
case was to not call it when finalizing multiple blocks at once until
the very last moment. It works and allows to finish the whole
finalization in just 14 seconds (5.4+5.4 of which are two calls to
`displaced_leaves_after_finalizing`). I'm really not happy with the fact
that `displaced_leaves_after_finalizing` is called twice, but much
heavier refactoring would be necessary to get rid of second call.

---

Next steps:
* assuming the changes are acceptable I'll write prdoc
* #4920 or something
similar in spirit should be implemented to unleash efficient parallelsm
with rayon in `displaced_leaves_after_finalizing`, which will allow to
further (and significant!) scale its performance rather that being
CPU-bound on a single core, also reading database sequentially should
ideally be avoided
* someone should look into removal of the second
`displaced_leaves_after_finalizing` call
* further cleanups are possible if `undo_finalization` can be removed

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Sebastian Kunert <[email protected]>
TomaszWaszczyk pushed a commit to TomaszWaszczyk/polkadot-sdk that referenced this issue Jul 7, 2024
This PR largely fixes
paritytech#4903 by addressing it
from a few different directions.

The high-level observation is that complexity of finalization was
unfortunately roughly `O(n^3)`. Not only
`displaced_leaves_after_finalizing` was extremely inefficient on its
own, especially when large ranges of blocks were involved, it was called
once upfront and then on every single block that was finalized over and
over again.

The first commit refactores code adjacent to
`displaced_leaves_after_finalizing` to optimize memory allocations. For
example things like `BTreeMap<_, Vec<_>>` were very bad in terms of
number of allocations and after analyzing code paths was completely
unnecessary and replaced with `Vec<(_, _)>`. In other places allocations
of known size were not done upfront and some APIs required unnecessary
cloning of vectors.

I checked invariants and didn't find anything that was violated after
refactoring.

Second commit completely replaces `displaced_leaves_after_finalizing`
implementation with a much more efficient one. In my case with ~82k
blocks and ~13k leaves it takes ~5.4s to finish
`client.apply_finality()` now.

The idea is to avoid querying the same blocks over and over again as
well as introducing temporary local cache for blocks related to leaves
above block that is being finalized as well as local cache of the
finalized branch of the chain. I left some comments in the code and
wrote tests that I belive should check all code invariants for
correctness. `lowest_common_ancestor_multiblock` was removed as
unnecessary and not great in terms of performance API, domain-specific
code should be written instead like done in
`displaced_leaves_after_finalizing`.

After these changes I noticed finalization is still horribly slow,
turned out that even though `displaced_leaves_after_finalizing` was way
faster that before (probably order of magnitude), it was called for
every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge
case was to not call it when finalizing multiple blocks at once until
the very last moment. It works and allows to finish the whole
finalization in just 14 seconds (5.4+5.4 of which are two calls to
`displaced_leaves_after_finalizing`). I'm really not happy with the fact
that `displaced_leaves_after_finalizing` is called twice, but much
heavier refactoring would be necessary to get rid of second call.

---

Next steps:
* assuming the changes are acceptable I'll write prdoc
* paritytech#4920 or something
similar in spirit should be implemented to unleash efficient parallelsm
with rayon in `displaced_leaves_after_finalizing`, which will allow to
further (and significant!) scale its performance rather that being
CPU-bound on a single core, also reading database sequentially should
ideally be avoided
* someone should look into removal of the second
`displaced_leaves_after_finalizing` call
* further cleanups are possible if `undo_finalization` can be removed

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Sebastian Kunert <[email protected]>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
This PR largely fixes
paritytech#4903 by addressing it
from a few different directions.

The high-level observation is that complexity of finalization was
unfortunately roughly `O(n^3)`. Not only
`displaced_leaves_after_finalizing` was extremely inefficient on its
own, especially when large ranges of blocks were involved, it was called
once upfront and then on every single block that was finalized over and
over again.

The first commit refactores code adjacent to
`displaced_leaves_after_finalizing` to optimize memory allocations. For
example things like `BTreeMap<_, Vec<_>>` were very bad in terms of
number of allocations and after analyzing code paths was completely
unnecessary and replaced with `Vec<(_, _)>`. In other places allocations
of known size were not done upfront and some APIs required unnecessary
cloning of vectors.

I checked invariants and didn't find anything that was violated after
refactoring.

Second commit completely replaces `displaced_leaves_after_finalizing`
implementation with a much more efficient one. In my case with ~82k
blocks and ~13k leaves it takes ~5.4s to finish
`client.apply_finality()` now.

The idea is to avoid querying the same blocks over and over again as
well as introducing temporary local cache for blocks related to leaves
above block that is being finalized as well as local cache of the
finalized branch of the chain. I left some comments in the code and
wrote tests that I belive should check all code invariants for
correctness. `lowest_common_ancestor_multiblock` was removed as
unnecessary and not great in terms of performance API, domain-specific
code should be written instead like done in
`displaced_leaves_after_finalizing`.

After these changes I noticed finalization is still horribly slow,
turned out that even though `displaced_leaves_after_finalizing` was way
faster that before (probably order of magnitude), it was called for
every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge
case was to not call it when finalizing multiple blocks at once until
the very last moment. It works and allows to finish the whole
finalization in just 14 seconds (5.4+5.4 of which are two calls to
`displaced_leaves_after_finalizing`). I'm really not happy with the fact
that `displaced_leaves_after_finalizing` is called twice, but much
heavier refactoring would be necessary to get rid of second call.

---

Next steps:
* assuming the changes are acceptable I'll write prdoc
* paritytech#4920 or something
similar in spirit should be implemented to unleash efficient parallelsm
with rayon in `displaced_leaves_after_finalizing`, which will allow to
further (and significant!) scale its performance rather that being
CPU-bound on a single core, also reading database sequentially should
ideally be avoided
* someone should look into removal of the second
`displaced_leaves_after_finalizing` call
* further cleanups are possible if `undo_finalization` can be removed

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Sebastian Kunert <[email protected]>
sfffaaa pushed a commit to peaqnetwork/polkadot-sdk that referenced this issue Dec 27, 2024
This PR largely fixes
paritytech#4903 by addressing it
from a few different directions.

The high-level observation is that complexity of finalization was
unfortunately roughly `O(n^3)`. Not only
`displaced_leaves_after_finalizing` was extremely inefficient on its
own, especially when large ranges of blocks were involved, it was called
once upfront and then on every single block that was finalized over and
over again.

The first commit refactores code adjacent to
`displaced_leaves_after_finalizing` to optimize memory allocations. For
example things like `BTreeMap<_, Vec<_>>` were very bad in terms of
number of allocations and after analyzing code paths was completely
unnecessary and replaced with `Vec<(_, _)>`. In other places allocations
of known size were not done upfront and some APIs required unnecessary
cloning of vectors.

I checked invariants and didn't find anything that was violated after
refactoring.

Second commit completely replaces `displaced_leaves_after_finalizing`
implementation with a much more efficient one. In my case with ~82k
blocks and ~13k leaves it takes ~5.4s to finish
`client.apply_finality()` now.

The idea is to avoid querying the same blocks over and over again as
well as introducing temporary local cache for blocks related to leaves
above block that is being finalized as well as local cache of the
finalized branch of the chain. I left some comments in the code and
wrote tests that I belive should check all code invariants for
correctness. `lowest_common_ancestor_multiblock` was removed as
unnecessary and not great in terms of performance API, domain-specific
code should be written instead like done in
`displaced_leaves_after_finalizing`.

After these changes I noticed finalization is still horribly slow,
turned out that even though `displaced_leaves_after_finalizing` was way
faster that before (probably order of magnitude), it was called for
every single of those 82k blocks 🤦

The quick hack I came up with in the third commit to handle this edge
case was to not call it when finalizing multiple blocks at once until
the very last moment. It works and allows to finish the whole
finalization in just 14 seconds (5.4+5.4 of which are two calls to
`displaced_leaves_after_finalizing`). I'm really not happy with the fact
that `displaced_leaves_after_finalizing` is called twice, but much
heavier refactoring would be necessary to get rid of second call.

---

Next steps:
* assuming the changes are acceptable I'll write prdoc
* paritytech#4920 or something
similar in spirit should be implemented to unleash efficient parallelsm
with rayon in `displaced_leaves_after_finalizing`, which will allow to
further (and significant!) scale its performance rather that being
CPU-bound on a single core, also reading database sequentially should
ideally be avoided
* someone should look into removal of the second
`displaced_leaves_after_finalizing` call
* further cleanups are possible if `undo_finalization` can be removed

---

Polkadot Address: 1vSxzbyz2cJREAuVWjhXUT1ds8vBzoxn2w4asNpusQKwjJd

---------

Co-authored-by: Sebastian Kunert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants