Finalization stalling issues 2024-07 #1104

MOZGIII · 2024-07-26T07:04:22Z

This is a tracking issue for the recurring finalization stalling incidents.

So far, here are the pointers we have.

Commits that are affecting the finalization directly:

Fix longest chain finalization target lookup paritytech/substrate#13289
Should not the the cause though, as the issue is with the inability of grandpa to follow a reorg, which is cause by pruned blocks. Or, rather, it could be the cause, but this one is not really where the bug we'd like to fix is.
Fix block pruning after long re-org paritytech/substrate#13323 - ~~could this be the reason?~~ no, this one works as intended
grandpa: don't error if best block and finality target are inconsistent paritytech/substrate#13364 - this looks like it directly messes with what's causing the bug, but more like it fixes the aftermath of the one above

Notable historical commits that might've affected this incident:

Epoch-Changes tree pruning was lagging by one epoch paritytech/substrate#12567
babe: allow skipping over empty epochs paritytech/substrate#11727 (unlikely though)
Remove uncles related code paritytech/substrate#13216 (actually looks somewhat interesting)
Notification-based block pinning paritytech/substrate#13157 (maybe pinning the block that is used as a finality based is a good idea)

Probably irrelevant:

Related things from the future:

Change forks pruning algorithm. paritytech/polkadot-sdk#3962 - new fork pruning algorithm - might work better, but not sure
Finalization hangs in 1.13 paritytech/polkadot-sdk#4903 - irrelevant for our issue, as this issue is with the new fork pruning algorithm mentioned above being too slow

~~The thoughts we have so far is that the issue is caused by the broken block pruning (which is meaningful when an unfortunate best block selection happens).~~ (UPD: the blocks are not really pruned, we can still request them from the API; this means this is a selection issue) Grandpa tries to finalize the block that is pruned and not the part of the best chain anymore; upon receiving the precommits it fails to recognize them because the block can't be resolved (it's pruned). This is quite odd, since the nodes operate with --blocks-pruning archive, so it must mean they haven't seen the block in question as is it no never to be pruned if seen - but in reality we see a the reorg happening from the block in question to a new one, so the old block is definitely a known one.

We are currently on substrate 0.9.40.

Mainnet encountered this at least three times lately:

The text was updated successfully, but these errors were encountered:

MOZGIII · 2024-07-26T08:42:15Z

2024-07-26 02:31:13.467  WARN tokio-runtime-worker grandpa: Encountered error finding best chain containing 0xab3e065e87c2c2e3ac34cb07a0e05c1fa7dd7df2d782fee66a8c284bbd430fd0: couldn't find target block: Chain lookup failed: Requested a finality target using a base 0xab3e065e87c2c2e3ac34cb07a0e05c1fa7dd7df2d782fee66a8c284bbd430fd0 not in the best chain 0x325ab807d115074518d0c5bd764de875b0945d0166d0937753968f98b4c2805d

Explore-logs-2024-07-26 10_41_41.txt
Explore-logs-2024-07-26 10_53_12.txt
The logs from the node to hint what's going on

MOZGIII · 2024-07-26T09:44:12Z

Current plan for a workaround is to delay the finality for more than 2 blocks. Currently we are planning to delay the finality after by 5 7 blocks beyond the best block. This should help with reducing the chance of the reorg-caused race condition we are facing.

dmitrylavrenov · 2024-07-26T09:48:17Z

Added the test to reproduce the issue logic - humanode-network/polkadot-sdk@master...humanode-network:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

It shows the root cause that we face.

MOZGIII · 2024-07-26T10:10:29Z

We have analyzed the code, and it looks like the issue is with those two PRs:

Fix longest chain finalization target lookup paritytech/substrate#13289
grandpa: don't error if best block and finality target are inconsistent paritytech/substrate#13364

We think that the fix that is added at paritytech/substrate#13364 does not apply after the code added at paritytech/substrate#13289 returns an error. We'll investigate further.

We are still on course to apply the workaround mentioned above as a hotfix, but the long term solution is also in sight.

MOZGIII · 2024-07-26T10:47:44Z

To elaborate more on paritytech/substrate#13364:

LongestChain::finality_target will error out when the passed target block is not in the best chain (https://github.com/paritytech/substrate/pull/13289/files#diff-97e7e3622ec6f0b455d18c1bf8055ccf5cfd210efd50348df6b07fce4531baeeR115-R119). This also applies to the case where the block number of the target and best-chain block are the same.
The fixup code at https://github.com/paritytech/substrate/pull/13364/files#diff-5ecb4cf31ed389a0d12482137c8b234485b199c200b09446cf8876321527446aR1242 tries to use the best block as target block (best block to target block) but in practice that part of the fn is not reached due to how an error from LongestChain::finality_target above is handled.
As a result, the intended fix is not applied, and we see the error logged instead - but also the consensus stalls.

bkchr · 2024-07-26T11:18:01Z

Added the test to reproduce the issue logic - humanode-network/[email protected]:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

The test is succeeding?

dmitrylavrenov · 2024-07-26T11:24:04Z

Added the test to reproduce the issue logic - humanode-network/[email protected]:polkadot-sdk:grandpa-voting-and-babe-reorg-race-condition.

The test is succeeding?

Yep, we've implemented it in the way to verify our thoughts about the happened case.

dmitrylavrenov · 2024-07-26T11:25:10Z

We met the same error

	// `hashof8_a` should be finalized next based on last completed round data,
	// but it's not an valid finality target
	assert_matches!(
		env.select_chain.finality_target(hashof8_a, None).await.unwrap_err(),
		ConsensusError::ChainLookup(_)
	);

andresilva · 2024-07-26T11:25:18Z

I have only looked into this superficially, but I think @MOZGIII comment on the LongestChain::finality_target issue is correct. The root cause is that we shouldn't error when the given vote target is not in the best chain, otherwise we make it impossible for finality to proceed in case we have a round estimate that gets re-orged out. In principle what should happen is grandpa would finalize the fork that got re-orged out, and then the node would re-org again (after grandpa finalizes). I didn't think through all the consequences of such a change.

dmitrylavrenov · 2024-07-26T11:26:09Z

I have only looked into this superficially, but I think @MOZGIII comment on the LongestChain::finality_target issue is correct. The root cause is that we shouldn't error when the given vote target is not in the best chain, otherwise we make it impossible for finality to proceed in case we have a round estimate that gets re-orged out. In principle what should happen is grandpa would finalize the fork that got re-orged out, and then the node would re-org again (after grandpa finalizes). I didn't think through all the consequences of such a change.

The same thoughts we discussed with @MOZGIII

MOZGIII · 2024-07-26T11:30:57Z

@bkchr @andresilva should we expect any hotfix for this implemented at the parity side or rather work on this ourselves?

dmitrylavrenov · 2024-07-26T11:35:13Z

@bkchr @andresilva let me create PR of this test into our fork and describe it in more details

andresilva · 2024-07-26T11:39:06Z

paritytech/polkadot-sdk#5153

I think this should fix it.

MOZGIII · 2024-07-26T11:56:42Z

paritytech/polkadot-sdk#5153 looks good, thx!

This would still be risky to ship for us as it means the whole layer of logic dedicated to the grandpa-issued reorg was not used for a while now and may contain bugs or just be broken cause of being stale...

We'll think about how to test this. So far we have thought about replicating the scenario of having 1k+ validators on our load-test facilities - but in our prior load tests we haven't seen the finality stalls like we see in the live chain; that might be due to not running for enough time, or due to other conditions - so it unlikely we'll have an accurate test for this in the load testing environment.

Any thoughts on how to trigger this bug in the live-like environment?

andresilva · 2024-07-26T13:36:51Z

I think besides any significant changes to make sure the issue happens (e.g. changing vote logic to make sure validators vote on abandoned forks explicitly), the easiest way would be to make forks more likely to happen and hence make it more likely for the issue to surface. I think changing https://github.com/humanode-network/humanode/blob/master/crates/humanode-runtime/src/constants.rs#L49 to (1, 1) should make forks more prevalent, reducing block time might also help.

This shouldn't be changed on any live chain though, since these parameters can't be changed dynamically. You'd need to bootstrap a new test chain.

MOZGIII · 2024-07-26T13:42:30Z

Great idea! We will also change the block time from 6s to 1s for more stress...

Pretty sure it is fine, but just to double check - is it safe to ship f5bccf0 ?
Should not cause anything problematic afaik.

andresilva · 2024-07-26T13:54:49Z

Yes, don't see any problem with that (and should make the issue less likely to happen).

For the test above you can also remove the BeforeBestBlockBy rule to make sure the validators always vote for the latest block, which should make issues with reorgs even more likely!

MOZGIII · 2024-07-26T13:59:08Z

Great!

To summarize, we'll test your fix with the following:

(1, 1) primary at babe
1s block production time
removed BeforeBestBlockBy at voting rules

UPD: implemented at #1106

MOZGIII · 2024-07-27T11:56:11Z

Some new input, as the chain stalled again just now:

last finalized block 8745707 0x3c97148e16cc04c5df90b96f64f37d9dddf083468a6c4df2dd447f5396df58fd
epoch start 8745697 0xce0fac61989be9f69a27c5671e71b40f2adc579c231b933a613c5ab103fd5f7a (10 blocks away)
Reorg on #8745708,0x191a…c58d to #8745711,0xfb9c…21bc, common ancestor #8745707,0x3c97…58fd - to the future

MOZGIII · 2024-07-27T11:59:18Z

We did not push the fix just yet, only the bigger finalization delay

But we see this:

The issue seems to be that once the prevote for a round goes out it then can't progress and the grandpa stalls.

@andresilva @bkchr let me know if you need access to our infra

MOZGIII · 2024-07-27T12:10:44Z

Maybe some peers are not voting in the round because of the bug (and some do - like the one we have above), and thus the whole consensus stalls, as if the amount of online nodes is less that quorum...

Only 39% of the nodes have applied the finality delay update according to https://telemetry.humanode.io

andresilva · 2024-07-28T12:51:19Z

Yes, you'd need >66% of the nodes on the new version for it to effectively take effect.

MOZGIII · 2024-07-30T14:15:04Z

We are backporting the fix to our substrate fork; we are still working on reproducing the issue. @dmitrylavrenov will send an update when we have news on that front.

dmitrylavrenov · 2024-08-12T19:19:38Z

Status update

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Current plan is to run different load tests this week trying to reproduce faced issue.

bkchr · 2024-08-12T19:23:12Z

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Why not zombienet?

dmitrylavrenov · 2024-08-15T16:45:41Z

Still working on implementing our own chain-load-test k8s-based tool that allows to run substrate-based networks with different peers configs and some network chaos experiments (https://chaos-mesh.org).

Why not zombienet?

Will prepare the detailed answer a little bit later.

dmitrylavrenov · 2024-08-15T16:53:00Z

Status update

We were able to prepare a such config of our chain-load-test k8s-based tool that reproduces the exact issue we faced even with just 9 nodes.

Usually it requires less than 1 hour.

reproduced-grandpa-issue-logs-1.txt
reproduced-grandpa-issue-logs-2.txt
reproduced-grandpa-issue-logs.txt

Next step is preparing an image containing the fix and run tests again.

MOZGIII · 2024-08-17T06:24:14Z

On zombienet: in short, we tired it, and had issues with it.

The practical issues that we had was it rolled out too slow in our k8s testing environment, and took forever to deploy 1000 nodes (which we ran for testing some time ago).
Also, we dig deeper and found that it ~~does not provide features besides reading the logs and running assertions on those logs~~, and even that was heavily unoptimized for the scale of testing we needed.

We also did look under the hood of zombienet, and found it substantially overengineering to our taste: supporting multiple execution environments (besides k8s) is just not necessary for a tool like this, in our collective opinion.

Ultimately, we decided to go agains fixing zombienet issues and make our own thing. It is very basic so far, but much more convenient for us to use, and with it we were actually able to reproduce the finality stalls. Not from the get go, and we'd probably be able alter a zombienet setup in a similar way - but with zombienet it would be unlikely that we'd be able to integrate our k8s-specific setup into the upstream.
We are quite satisfied with the results so far, as we can proceed with the work on fixing this bug knowing we have means to reproduce the issue.

Compared to zombienet, our solution does not have the evens model just yet, and doesn't parse the logs - but we don't need it, although it will be not too difficult to add if we decide to.

We'll can do a showcase if you're interested, as it is in a private repo for now. We'll make it public soon (the soon tm kind of soon - no eta).

UPD: our initial motivation for trying zombienet was to run a test cluster with a lot of nodes (1000, 2000, 5000, ...) as we are preparing to grow our validators headcount.

dmitrylavrenov · 2024-08-19T11:29:49Z

On zombienet: in short, we tired it, and had issues with it.

The practical issues that we had was it rolled out too slow in our k8s testing environment, and took forever to deploy 1000 nodes (which we ran for testing some time ago). Also, we dig deeper and found that it ~~does not provide features besides reading the logs and running assertions on those logs~~, and even that was heavily unoptimized for the scale of testing we needed.

We also did look under the hood of zombienet, and found it substantially overengineering to our taste: supporting multiple execution environments (besides k8s) is just not necessary for a tool like this, in our collective opinion.

Ultimately, we decided to go agains fixing zombienet issues and make our own thing. It is very basic so far, but much more convenient for us to use, and with it we were actually able to reproduce the finality stalls. Not from the get go, and we'd probably be able alter a zombienet setup in a similar way - but with zombienet it would be unlikely that we'd be able to integrate our k8s-specific setup into the upstream. We are quite satisfied with the results so far, as we can proceed with the work on fixing this bug knowing we have means to reproduce the issue.

Compared to zombienet, our solution does not have the evens model just yet, and doesn't parse the logs - but we don't need it, although it will be not too difficult to add if we decide to.

We'll can do a showcase if you're interested, as it is in a private repo for now. We'll make it public soon (the soon tm kind of soon - no eta).

UPD: our initial motivation for trying zombienet was to run a test cluster with a lot of nodes (1000, 2000, 5000, ...) as we are preparing to grow our validators headcount.

I can add some important items from practical usage:

it was difficult to customize settings to make it working for networks with different SS58-prefix. Maybe it's already improved, but was needed to do some changes (the version I used in April-May) in the source code and rebuild by myself the tool.
It's not possible to rerun failed pods keeping running good ones. All time you need to rerun the network from clean. As a result, sometimes the test with 1000 nodes take almost forever as setup is terminated if an image for some pod were not able to be pulled for some networks reason.

dmitrylavrenov · 2024-08-19T11:32:27Z

Status update

@bkchr @andresilva we have great news!
We were able to test fixed version where we faced the same issue. But the consensus wasn't stucked and continued working as usual! Tested twice. You can check the logs :)

grandpa-fix-logs.txt
grandpa-fix-logs-1.txt

dmitrylavrenov · 2024-08-19T11:33:31Z

@MOZGIII I think we can close the issue and open the new one to backport the fix and ship to our mainnet?

andresilva · 2024-08-19T13:34:36Z

Thanks for the thorough testing 👌

MOZGIII · 2024-08-19T17:13:42Z

@MOZGIII I think we can close the issue and open the new one to backport the fix and ship to our mainnet?

Yes.

The investigation is now finished, and we were able to identify the bug, theorize and implement a fix, reproduce the bug reliably in a controlled environment, apply the fix and ensure the bug does not reproduce under the same conditions anymore, and that there are no other surprising effects from applying the fix.

We are now working on releasing the fix on the Humanode network.

Thanks everyone for taking part in this work!

MOZGIII added the tracking A tracking issue for a big multi-PR initiative label Jul 26, 2024

MOZGIII assigned MOZGIII and dmitrylavrenov Jul 26, 2024

MOZGIII mentioned this issue Jul 26, 2024

Finalisation stopped : Encountered error finding best chain paritytech/polkadot-sdk#3487

Closed

andresilva mentioned this issue Jul 26, 2024

grandpa: handle error from SelectChain::finality_target paritytech/polkadot-sdk#5153

Merged

This was referenced Jul 30, 2024

[backport] grandpa: handle error from SelectChain::finality_target (#5153) humanode-network/substrate#105

Merged

Apply substrate-backport: grandpa: handle error from SelectChain::finality_target (#5153) #1109

Merged

MOZGIII mentioned this issue Aug 2, 2024

humanode stay 8791754 nu #1119

Closed

This was referenced Aug 19, 2024

Backport fix for finalization stalling issue #1201

Closed

[backport] grandpa: handle error from SelectChain::finality_target humanode-network/substrate#108

Merged

MOZGIII closed this as completed Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finalization stalling issues 2024-07 #1104

Finalization stalling issues 2024-07 #1104

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

dmitrylavrenov commented Jul 26, 2024

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

bkchr commented Jul 26, 2024

dmitrylavrenov commented Jul 26, 2024

dmitrylavrenov commented Jul 26, 2024

andresilva commented Jul 26, 2024

dmitrylavrenov commented Jul 26, 2024

MOZGIII commented Jul 26, 2024

dmitrylavrenov commented Jul 26, 2024 •

edited

Loading

andresilva commented Jul 26, 2024

MOZGIII commented Jul 26, 2024 •

edited

Loading

andresilva commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024

andresilva commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 27, 2024

MOZGIII commented Jul 27, 2024

MOZGIII commented Jul 27, 2024 •

edited

Loading

andresilva commented Jul 28, 2024 •

edited

Loading

MOZGIII commented Jul 30, 2024

dmitrylavrenov commented Aug 12, 2024 •

edited

Loading

bkchr commented Aug 12, 2024

dmitrylavrenov commented Aug 15, 2024

dmitrylavrenov commented Aug 15, 2024

MOZGIII commented Aug 17, 2024 •

edited

Loading

dmitrylavrenov commented Aug 19, 2024

dmitrylavrenov commented Aug 19, 2024

dmitrylavrenov commented Aug 19, 2024 •

edited

Loading

andresilva commented Aug 19, 2024

MOZGIII commented Aug 19, 2024

Finalization stalling issues 2024-07 #1104

Finalization stalling issues 2024-07 #1104

Comments

MOZGIII commented Jul 26, 2024 • edited Loading

MOZGIII commented Jul 26, 2024 • edited Loading

MOZGIII commented Jul 26, 2024 • edited Loading

dmitrylavrenov commented Jul 26, 2024

MOZGIII commented Jul 26, 2024 • edited Loading

MOZGIII commented Jul 26, 2024 • edited Loading

bkchr commented Jul 26, 2024

dmitrylavrenov commented Jul 26, 2024

dmitrylavrenov commented Jul 26, 2024

andresilva commented Jul 26, 2024

dmitrylavrenov commented Jul 26, 2024

MOZGIII commented Jul 26, 2024

dmitrylavrenov commented Jul 26, 2024 • edited Loading

andresilva commented Jul 26, 2024

MOZGIII commented Jul 26, 2024 • edited Loading

andresilva commented Jul 26, 2024 • edited Loading

MOZGIII commented Jul 26, 2024

andresilva commented Jul 26, 2024 • edited Loading

MOZGIII commented Jul 26, 2024 • edited Loading

MOZGIII commented Jul 27, 2024

MOZGIII commented Jul 27, 2024

MOZGIII commented Jul 27, 2024 • edited Loading

andresilva commented Jul 28, 2024 • edited Loading

MOZGIII commented Jul 30, 2024

dmitrylavrenov commented Aug 12, 2024 • edited Loading

bkchr commented Aug 12, 2024

dmitrylavrenov commented Aug 15, 2024

dmitrylavrenov commented Aug 15, 2024

MOZGIII commented Aug 17, 2024 • edited Loading

dmitrylavrenov commented Aug 19, 2024

dmitrylavrenov commented Aug 19, 2024

dmitrylavrenov commented Aug 19, 2024 • edited Loading

andresilva commented Aug 19, 2024

MOZGIII commented Aug 19, 2024

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

dmitrylavrenov commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

andresilva commented Jul 26, 2024 •

edited

Loading

andresilva commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 26, 2024 •

edited

Loading

MOZGIII commented Jul 27, 2024 •

edited

Loading

andresilva commented Jul 28, 2024 •

edited

Loading

dmitrylavrenov commented Aug 12, 2024 •

edited

Loading

MOZGIII commented Aug 17, 2024 •

edited

Loading

dmitrylavrenov commented Aug 19, 2024 •

edited

Loading