Kusama parachains block production slowed down #3314

KarimJedda · 2024-02-13T19:26:29Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

Block production has slowed down on Kusama parachains
The issue was raised by Khala & Basilisk, with reports of 130s blocks on Basilisk
Kusama seems to be impacted as well

We're investigating, let me know how we can help. @eskimor would you mind adding the peer count screenshot here as well for reference please?

cc @lovelaced

Steps to reproduce

KarimJedda · 2024-02-13T20:14:19Z

Below is an extract from our data for the Kusama relay chain:

--- number of blocks that took more than 6001 ms to produce (limited to start of February 2024 until today)
SELECT 
  CAST(block_time as DATE) as day, 
  count(number) as slow_blocks
FROM `parity-data-infra-evaluation.dotlake_playground.kusama_feb_24_blocktimes` 
WHERE time_to_produce_ms > 6002 
GROUP BY day
ORDER BY day ASC;

day	slow_blocks
2024-02-01	148
2024-02-02	142
2024-02-03	106
2024-02-04	136
2024-02-05	137
2024-02-06	118
2024-02-07	162
2024-02-08	139
2024-02-09	109
2024-02-10	112
2024-02-11	106
2024-02-12	106
2024-02-13 (17:30 CET)	75

For reference, here's the same thing but by day and since January 2024:

Here are the top 15, ordered by the time it took for them to be produced:

SELECT 
  number,
  block_time,
  time_to_produce_ms
FROM `parity-data-infra-evaluation.dotlake_playground.kusama_feb_24_blocktimes` 
ORDER BY time_to_produce_ms DESC
LIMIT 15;

(Subscan linked for convenience)

number	block_time	time_to_produce_ms
21806210	2024-02-09 19:31:54.000000 UTC	42000
21824720	2024-02-11 02:30:06.000000 UTC	41999
21758223	2024-02-06 11:12:48.001000 UTC	18001
21768615	2024-02-07 04:36:00.001000 UTC	18001
21818664	2024-02-10 16:20:54.000000 UTC	18000
21754427	2024-02-06 04:52:30.000000 UTC	18000
21727213	2024-02-04 07:21:00.000000 UTC	18000
21851077	2024-02-12 22:32:18.000000 UTC	17999
21766166	2024-02-07 00:29:30.000000 UTC	17999
21686302	2024-02-01 10:58:12.000000 UTC	17999
21778337	2024-02-07 20:53:48.000000 UTC	17992
21745818	2024-02-05 14:28:43.863000 UTC	13862
21711425	2024-02-03 04:58:12.481000 UTC	12480
21811860	2024-02-10 04:58:06.168000 UTC	12167
21700641	2024-02-02 10:58:06.137000 UTC	12137

And the list of slow blocks (that took longer than 6002 ms, February 2024) for reference: https://storage.googleapis.com/dashboards.data.paritytech.io/debug/slow_blocks.csv

Anything else we can provide on the parachain side too please let us know.

alexggh · 2024-02-14T11:39:24Z

Investigation

We did some initial analysis looking at the dashboard it seems that around 2023-02-12 18.00 PM UTC the average block times seems to have jumped slightly see the graph bellow.

Looking the dashboards around the same time we also see some errors in pending connection metrics on our kusama validators, not sure if it is related to the degradation, but I'm putting it here in case it rings a bell.

Potential root-cause(Behaviour observed after 2023-02-12 18.00 PM UTC)

Looking at few parachains we observed that there are 1 minute periods where no blocks gets backed for the parachains and after further investigation we observed that it is during the same group of the session that the parachains don't produce blocks:

For example for session 36926 parachains AssetHub and Bridge don't produce blocks while they should be allocated to group 20, this can be double-checked by looking at CandidateIncluded events and see that the parachains produce blocks while on group 19, then both have a gap while they should be on 20 and after that they start producing blocks when they assignment moves on group 21.

Parachain 1000: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.dotters.network%2Fkusama#/explorer/query/21870979

Parachains 10002: https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc.dotters.network%2Fkusama#/explorer/query/21870956

Same things happens during session 36927 and this time group 44 is the bad one.

The potential root-cause of this is that we have a bunch of validators either in a bad state or they are running a version that doesn't not support the logic for async-backing, so they can't participate in the backing process, so if 2 out of 3 validators in the backing group are bad the parachain won't produce blocks for the entire duration(1m) it is allocated to the backing group.

Looking at https://apps.turboflakes.io/?chain=kusama#/insights at session 36926, there seems to be around 17 out 300 validators have 0 backing votes, so those might be the possible culprits for why we have a bad backing group every session.

The list of validators for session 36927 is here:

Note! This is just a snapshot for session 36927, so it is not an exhaustive list.

Next steps

Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running and what logs they see.

ggwpez · 2024-02-14T11:44:58Z

Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running.

You can try to query their version RPC.

bkchr · 2024-02-14T12:09:29Z

Ping validators on Kusama lounge to see what versions the ones with 0 backing votes are running.

You can try to query their version RPC.

I hope that you don't find anyone that has opened their rpc port to the outside 🙃

bkchr · 2024-02-14T12:15:26Z

What we should do is write some tool that connects to all validators on the p2p layer. We can query all validator addresses using the DHT and the initial message contains the version information about the node. Using this we could write some tool to generate a list of validators that run on an old version.

ordian · 2024-02-14T12:26:52Z

What we should do is write some tool that connects to all validators on the p2p layer. We can query all validator addresses using the DHT and the initial message contains the version information about the node. Using this we could write some tool to generate a list of validators that run on an old version.

I've heard someone already built this tool: https://forum.polkadot.network/t/experiment-p2p-network-visualization/4583/8

bkchr · 2024-02-14T12:33:15Z

Perfect! This is exactly what we need.

sandreim · 2024-02-14T12:33:22Z

Would it be possible to implement a better incentive for validators to upgrade ? One idea is to remove any validator from the active set if their version is lagging 2+ releases behind, but I am not sure how feasible is to implement.

bkchr · 2024-02-14T12:43:25Z

Sounds like you are proposing to do hardforks :P Maybe a little bit softer as ETH hard forks, but yeah :P

Apparently people should already get less rewards and they don't care.

BulatSaif · 2024-02-14T13:13:20Z

Here what I was able to extract from DHT with nebula:

    802 "Parity Polkadot/v1.7.0-97df9dd6554 
    405 "Parity Polkadot/v1.6.0-680adc78ab3 
    106 "Parity Polkadot/v1.7.0-2fe3145ab9b 
     53 "Parity Polkadot/v1.5.0-a3dc2f15f23 
     53 "Parity Polkadot/v1.4.0-a3e4fce9d3f 
     37 "Parity Polkadot/v1.6.0-481165d9229 
     36 "Parity Polkadot/v0.9.41-e203bfb396e 
     34 "Parity Polkadot/v1.3.0-7c9fd83805c 
     19 "Parity Polkadot/v1.0.0-1ed6e2e50a4 
     13 "Parity Polkadot/v1.6.0-85191e94b5e 
      9 "Parity Polkadot/v1.4.0-00395a23d12 
      7 "Parity Polkadot/v1.1.0-f60318f6868 
      7 "Parity Polkadot/v0.9.42-9b1fc27cec4 
      3 "Parity Polkadot/v1.0.0-c9ec8c5a159 
      3 "Parity Polkadot/v0.9.43-ba42b9ce51d 
      2 "Parity Polkadot/v1.7.0-unknown 
      2 "Parity Polkadot/v1.6.0-unknown 
      2 "Parity Polkadot/v1.6.0-ff2e7dbd355 
      2 "Parity Polkadot/v1.6.0-c415f8f0bd6 
      2 "Parity Polkadot/v0.9.40-a2b62fb872b 
      2 "Parity Polkadot/v0.9.32-c71e872afed 
      1 "Parity Polkadot/v1.6.0-b8f55d1b769 
      1 "Parity Polkadot/v1.6.0-481165d 
      1 "Parity Polkadot/v1.5.0-e5b2adac7ad 
      1 "Parity Polkadot/v1.5.0-64d52f225e1 
      1 "Parity Polkadot/v1.2.0-930cabbe271 
      1 "Parity Polkadot/v1.2.0-72c45356393 
      1 "Parity Polkadot/v0.9.43-52209dcfe54 
      1 "Parity Polkadot/v0.9.38-909567fe6da 
      1 "Parity Polkadot/v0.9.31-32dd0c9cfcd 
      1 "Parity Polkadot/v0.9.30-064536093f5 
      1 "Parity Polkadot/v0.9.25-5174e9ae75b

Here script which I used:

nebula crawl --network KUSAMA  --json-out /tmp/out > kusama.txt
cat kusama.txt | grep Agent | cut -d = -f3- | grep "Parity Polkadot"| cut -d'(' -f1| sort |  uniq -c |  sort -bgr

alexggh · 2024-02-14T13:29:48Z

Here what I was able to extract from DHT with nebula:

Thank you, the number seems a lot more different than what I see on telemetry any idea why ?

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

sandreim · 2024-02-14T13:45:27Z

Sounds like you are proposing to do hardforks :P Maybe a little bit softer as ETH hard forks, but yeah :P

There shouldn't be any forks :P

Apparently people should already get less rewards and they don't care.

I think they do care more if they get kicked out of active set. Also we won't waste valuable time investigating issues caused by ancient validators, has happened in a few instances.

BulatSaif · 2024-02-14T14:00:27Z

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

I guess it is parachain with an embedded relaychain node, they are usually not configured and have a random name, Also p2p port is not open and Nebula is not able to connect and check version. The nodes I listed all have p2p port open to the internet.

I forgot to add report summary:

crawlDuration=6m57.330136797s crawledPeers=9500 dialablePeers=2261 undialablePeers=7239

7239 node has p2p port closed and may have any version.

FYI, there are multiple telemetry: https://telemetry.w3f.community, https://telemetry.polkadot.io/

alexggh · 2024-02-14T14:13:43Z

Parachains block production starting to look like their pre 2023-02-12 18.00 PM UTC values:

The parachain block production average

AssetHub

alexggh · 2024-02-14T14:28:29Z

FYI, there are multiple telemetry: https://telemetry.w3f.community/, https://telemetry.polkadot.io/

Any idea which one tell us the right picture ?

lovelaced · 2024-02-14T14:33:04Z

FYI, there are multiple telemetry: https://telemetry.w3f.community/, https://telemetry.polkadot.io/

Any idea which one tell us the right picture ?

A combination of them and also none of them; the w3f telemetry is for the 1kv program, and the polkadot.io one is for validators who are connected to telemetry. Telemetry is on by default, but plenty of validators do not run with public telemetry enabled, so it's not possible to get a complete picture this way.

bkchr · 2024-02-14T15:42:02Z

I think they do care more if they get kicked out of active set. Also we won't waste valuable time investigating issues caused by ancient validators, has happened in a few instances.

As already discussed in Element, we should change the rewards. Decreasing the rewards per block build to a minimum and put most of the rewards in doing useful work for the parachain consensus.

The tool for finding out what version a validator is running, could also be used as data source for informing nominators on what kind of validators to nominate.

JelliedOwl · 2024-02-14T16:25:10Z

I see a lot more 1.3.0 and 0.9.43 and 0.9.42, any idea where the difference is coming from ?

I guess it is parachain with an embedded relaychain node, they are usually not configured and have a random name, Also p2p port is not open and Nebula is not able to connect and check version. The nodes I listed all have p2p port open to the internet.

There's definitely some of this going on. For example one of my parachain collators shows up on telemetry as "Parity Polkadot v0.9.28-something". That one definitely isn't validating.

(And, yes, an upgrade is long overdue, and I worry about the chain stalling every time Kusama upgrades...)

bkchr · 2024-02-14T16:26:43Z

(And, yes, an upgrade is long overdue, and I worry about the chain stalling every time Kusama upgrades...)

Generally we are binary compatible back to Kusama launch. Also most of the networking protocols are still in place, so you don't need to worry that much ;)

JelliedOwl · 2024-02-14T16:34:33Z

Generally we are binary compatible back to Kusama launch. Also most of the networking protocols are still in place, so you don't need to worry that much ;)

It's more a nagging concern at the back of my mind than a full-on worry. I do hope to find the time / funding to upgrade it though. :-)

[Edit: sorry BulatSaif already said this.] Oh, and I think @alexggh might have been asking why there are far fewer nodes on telemetry than in the P2P-derived stats. Being on telemetry is optional so many nodes aren't on it (including 3 of mine that I thought were). There's also the 1KV program telemetry that probably has a list of nodes, only some of which are on the main telemetry.
https://telemetry.w3f.community/#list/0xb0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe

alexggh · 2024-02-15T05:35:00Z

Unfortunately, the situation seems to have drastically deterioarated overnight around 2023-02-15 22:00 UTC, the dashboards look even worse now.

Runtime api requests jumped 20%

The only things I see in the logs is that around that time we've got some disputes, but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

2024-02-14  cannot query the runtime API version: Api called for an unknown Block: State already discarded for 0xda8c113bdaf76c305020386f430bc49ff6edcd28bd38c84ff1f04d7b7e67c1cb

Looking here: https://apps.turboflakes.io/?chain=kusama#/insights?mode=history, it seems that around that time a bunch of validators started getting very low reward rates(0) and they continue getting them for follow up sessions, so it feels like something is getting them in a bad state.

bkchr · 2024-02-15T07:13:41Z

but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

Do you know which release we included this fix to prevent disputes being raised for candidates in finalized blocks?

alexggh · 2024-02-15T07:20:20Z

but they seem to have concluded, around that time, this error was outputed during the dispute, then stopped.

Do you know which release we included this fix to prevent disputes being raised for candidates in finalized blocks?

Don't know which fix you are referring to.

tdimitrov · 2024-02-15T07:33:00Z

@bkchr #1358 -> this one?
If I'm not mistaken it's included since 1.0.0

bkchr · 2024-02-15T08:26:35Z

Yeah this one and yeah it got included in 1.1.

The log lines of state being discarded for the dispute reminded me of this. We should probably still figure out why the state wasn't available anymore as this should not happen.

Polkadot-Forum · 2024-02-15T08:28:45Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/polkadot-digest-15-feb-2024/6218/1

eskimor · 2024-02-15T08:39:17Z

@BulatSaif What is nebula and where can I find it?

BulatSaif · 2024-02-15T08:53:46Z

@BulatSaif What is nebula and where can I find it?

It is a DHT crawler: https://github.com/dennis-tra/nebula

…ng (#3419) Add more debug logs to understand if statement-distribution is in a bad state, should be useful for debugging #3314 on production networks. Additionally, increase the number of parallel requests should make, since I notice that requests take around 100ms on kusama, and the 5 parallel request was picked mostly random, no reason why we can do more than that. --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: ordian <[email protected]> (cherry picked from commit b9c792a) Signed-off-by: Alexandru Gheorghe <[email protected]>

…ng (#3476) cherry picked from master commit b9c792a --------- Add more debug logs to understand if statement-distribution is in a bad state, should be useful for debugging #3314 on production networks. Additionally, increase the number of parallel requests should make, since I notice that requests take around 100ms on kusama, and the 5 parallel request was picked mostly random, no reason why we can do more than that. Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: ordian <[email protected]>

burdges · 2024-03-04T09:08:25Z

We should really drop outdated hosts before updates that required host updates. In principle, we could just record minimum required and recommended host versions on-chain, make validators warn when behind the recommended, and make them shut down when behind the required. This doesn't solve the immediate problem however, so..

We could add host version into their NPoS candidacy, so then other validators would drop them in NPoS. I doubt anything more complex would be warranted.

It's perfectly reasonable byzantine behavior to run an old host and lie about version numbers though, so probably not this whole problem.

sandreim · 2024-03-04T09:17:34Z

This is fairly similar to what I had in mind and been discussing offline, but @bkchr didn't fully support it at the time. These ancient hosts clearly not getting rewards in current situation so they still have incentive to upgrade. The only downside in this specific case is that they can degrade the network performance, which sounds to me that we need some form of punishment to prevent it rather than rely only on positive rewards.

bkchr · 2024-03-04T10:02:19Z

Yeah, generally I'm not full on board yet with reporting random versions to the runtime. I mean the node can just report whatever the runtime expects it to report. We would also need to standardize these versions as other node implementors would need to report the same version when they have implemented the same features etc.

I opened an issue for reworking the reward points distribution: polkadot-fellows/runtimes#214

If this isn't working, I'm open for what you both proposed here.

JelliedOwl · 2024-03-04T11:02:28Z

However you do implement this (if you ever do) please don't forget about us collators, running potentially pretty old versions on the Kusama side... :-). How old a version does the CI process test can still sync?

burdges · 2024-03-04T12:13:00Z

the node can just report whatever the runtime expects it to report

Yes, but then if they change the code like this then it's their fault if they get slashed somehow.

We could easily design this so they must look at the code before reporting the false version, at which point they're doing more work than not upgrading. Imho, that's overkill since we're trying to help them really, and an unecessary development burden (each version has an associated random u64).

Arguably, we should do this on polakdot, but not immediately on kusama, since we want some miss behavior on kusama.

please don't forget about us collators

We're discussing validators mostly, but yes we do need to take host upgrades into consideration for collators too. It'll be off-chain messaging (XCMP) that really beings the chaos here.

This will enable async-backing subsystems on polkadot, the relaychain would work in legacy mode where only candiddates built on top of pervious relay chain are activated. - Had to bring an unrelated change as well `minimum_backing_votes` because that was the v6, we can't skip versions, the value of that configuration on polkadot is 2, so that's what is going to be used after this runtime update. Changes have been running on kusama, so I would. Async backing subsytems is a major change in the way collator-protocol and the backing work, so there are still some unknowns if this is completely bug free. It has been running on kusama for a month already, but not without issues: - Validators that did not upgrade to compatible versions will not be able to participate in backing, so if enough of those are randomly picked that group won't be able to back anything. With backing_group_size = 5 and minimum_backing_votes = 2, 10% validator not upgraded that chance is about 2.5%. - Additionally, same un-upgraded groups won't be able to include the backing votes on chain when they author blocks, so 10% of the blocks won't back any candidates. - We are still not sure if item 2) from here paritytech/polkadot-sdk#3314 (comment) is caused by async backing, the observable issue is sometimes after restart/upgrade some validators are getting 0 rewards and apparently they are not backing anything. Signed-off-by: Alexandru Gheorghe <[email protected]>

tunkas · 2024-03-11T12:22:47Z

Not sure if this will be helpful on this issue, but it might so I'm posting my logs.
It happened on my Kusama validator, in the first session I paravalidated after getting into the active set.
I was missing all the votes in that session so 45 minutes into the active set restart was performed. Not sure if this had any influence on resolving the matter since I still missed all the votes within that session.
Next session went without missing significat amount of votes.
Here are the logs for that session, and subsequent one.
I'm running on the latest version, of course.
Kusama error report.txt

dcolley · 2024-03-12T08:08:09Z

https://apps.turboflakes.io/?chain=kusama#/validator/JKhBBSWkr8BJKh5eFBtRux4hsDq4sAxvvmMU426qUA9aqEQ

What version you are running.
1.8.0
Logs around that time, the more logs you can provide the better.


│2024-03-11 18:56:54 💤 Idle (12 peers), best: #22250622 (0x70e8…48f8), finalized #22250619 (0x0168…d6c3), ⬇ 2.9MiB/s ⬆ 2.6MiB/s                  │
│2024-03-11 18:56:59 💤 Idle (12 peers), best: #22250622 (0x70e8…48f8), finalized #22250620 (0x1248…9c9b), ⬇ 3.7MiB/s ⬆ 3.9MiB/s                  │
│2024-03-11 18:57:00 ✨ Imported #22250623 (0x44c9…5815)                                                                                          │
│2024-03-11 18:57:04 💤 Idle (12 peers), best: #22250623 (0x44c9…5815), finalized #22250620 (0x1248…9c9b), ⬇ 2.8MiB/s ⬆ 3.5MiB/s                  │
│2024-03-11 18:57:06 ✨ Imported #22250624 (0xf82d…5620)                                                                                          │
│2024-03-11 18:57:06 ✨ Imported #22250624 (0x6339…6a9e)                                                                                          │
│2024-03-11 18:57:09 💤 Idle (12 peers), best: #22250624 (0xf82d…5620), finalized #22250621 (0x1429…942a), ⬇ 2.2MiB/s ⬆ 3.0MiB/s                  │
│2024-03-11 18:57:12 ✨ Imported #22250625 (0xd906…805a)                                                                                          │
│2024-03-11 18:57:14 💤 Idle (12 peers), best: #22250625 (0xd906…805a), finalized #22250622 (0x70e8…48f8), ⬇ 1.6MiB/s ⬆ 2.1MiB/s                  │
│2024-03-11 18:57:18 ✨ Imported #22250626 (0x3d86…c5d6)                                                                                          │
│2024-03-11 18:57:18 ♻️  Reorg on #22250626,0x3d86…c5d6 to #22250626,0xa7af…d809, common ancestor #22250625,0xd906…805a                            │
│2024-03-11 18:57:18 ✨ Imported #22250626 (0xa7af…d809)                                                                                          │
│2024-03-11 18:57:19 💤 Idle (10 peers), best: #22250626 (0xa7af…d809), finalized #22250623 (0x44c9…5815), ⬇ 2.3MiB/s ⬆ 2.9MiB/s                  │
│2024-03-11 18:57:24 ✨ Imported #22250627 (0xffd6…0d62)                                                                                          │
│2024-03-11 18:57:24 ✨ Imported #22250627 (0x6aaf…cbca)                                                                                          │
│2024-03-11 18:57:24 ✨ Imported #22250627 (0x129f…5e18)                                                                                          │
│2024-03-11 18:57:24 💤 Idle (10 peers), best: #22250627 (0xffd6…0d62), finalized #22250624 (0xf82d…5620), ⬇ 2.1MiB/s ⬆ 2.8MiB/s                  │
│2024-03-11 18:57:29 💤 Idle (11 peers), best: #22250627 (0xffd6…0d62), finalized #22250625 (0xd906…805a), ⬇ 1.6MiB/s ⬆ 2.1MiB/s                  │
│2024-03-11 18:57:30 ✨ Imported #22250628 (0x41d1…dd49)                                                                                          │
│2024-03-11 18:57:30 ♻️  Reorg on #22250628,0x41d1…dd49 to #22250628,0xeb99…6bb5, common ancestor #22250627,0xffd6…0d62                            │
│2024-03-11 18:57:30 ✨ Imported #22250628 (0xeb99…6bb5)                                                                                          │
│2024-03-11 18:57:30 ✨ Imported #22250628 (0x1091…17dc)                                                                                          │
│2024-03-11 18:57:30 ✨ Imported #22250628 (0xd5e2…0382)                                                                                          │
│2024-03-11 18:57:34 💤 Idle (12 peers), best: #22250628 (0xeb99…6bb5), finalized #22250625 (0xd906…805a), ⬇ 2.1MiB/s ⬆ 3.5MiB/s     
...
2024-03-11 18:57:48 ✨ Imported #22250631 (0xf38f…2bce)                                                                                          │
│2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers                              │
│                                                                                                                                                 │
│Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(91): {(Vali│
datorIndex(91), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}, ValidatorIndex(92): {(ValidatorI│
ndex(91), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}, ValidatorIndex(90): {(ValidatorIndex(9│
1), CompactStatement::Seconded(0x5e3efe877a0496964281fe2fb92125aed001360d96d6521e0e0c9abc66ecdb6c))}} parent_hash=0xeb997bbb60ca75e90935d59c1b142f│
a57aac5402cd28b577fa2211cd7a026bb5                                                                                                                │
│2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers                              │
│                                                                                                                                                 │
│Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(90): {(Vali│
datorIndex(91), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}, ValidatorIndex(91): {(ValidatorI│
ndex(91), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}, ValidatorIndex(92): {(ValidatorIndex(9│
1), CompactStatement::Seconded(0x54ec5323d263a93d6fc973d7b5d7a426649ce7d3e87d33e447e09089f5e9b2bc))}} parent_hash=0x1091044e68e1e359c1518de03a776d│
0f14b81f7aed347fbc0fca7a49dfed17dc                                                                                                                │
│2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers                              │
│                                                                                                                                                 │
│Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(91): {(Vali│
datorIndex(91), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}, ValidatorIndex(92): {(ValidatorI│
ndex(91), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}, ValidatorIndex(90): {(ValidatorIndex(9│
1), CompactStatement::Seconded(0x43bd5e82bdc8ce77c77d3e68f983719bb94dd4e2e6b98af5d6e42523d9a1de5a))}} parent_hash=0xd5e2b78489e5eda3fabe2845417bce│
90215b94a7685d1419af1a75b6eacb0382                                                                                                                │
│2024-03-11 18:57:48 Cluster has too many pending statements, something wrong with our connection to our group peers                              │
│                                                                                                                                                 │
│Restart might be needed if validator gets 0 backing rewards for more than 3-4 consecutive sessions pending_statements={ValidatorIndex(90): {(Vali│
datorIndex(91), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}, ValidatorIndex(91): {(ValidatorI│
ndex(91), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}, ValidatorIndex(92): {(ValidatorIndex(9│
1), CompactStatement::Seconded(0x927a5f537c174e8d4790b7e2708a242bebba6572db71424dbb4ace9511b8d1c3))}} parent_hash=0x41d1cf34c40cdda18adf75b22ebf37│
bad22b623fad674d0d050d3531865ddd49                                                                                                                │
│2024-03-11 18:57:49 💤 Idle (12 peers), best: #22250631 (0xf38f…2bce), finalized #22250628 (0x1091…17dc), ⬇ 3.4MiB/s ⬆ 3.4MiB/s                  │
│2024-03-11 18:57:54 ✨ Imported #222506

Any event around that time, e.g we upgraded/restart the validator, we observed validator entered the active set, anything that you think might be relevant.
No intervention
Anything relevant/special about your network setup
Running in docker
OS, does the validator passes the hardware requirement checks.
yes
Any action take to get the validator out of this state or did it recover by itself.
recovered by itself, next session was A+

eskimor · 2024-03-12T09:15:48Z

Running in docker

I remember we had issues with Docker in the past. Are other affected validators also running on Docker?

dcolley · 2024-03-12T09:58:51Z

These are the startup logs:

│2024-03-04 20:09:38 This chain is not in any way                                                                                                 ┤
│2024-03-04 20:09:38       endorsed by the                                                                                                        ┤
│2024-03-04 20:09:38      KUSAMA FOUNDATION                                                                                                       ┤
│2024-03-04 20:09:38 ----------------------------                                                                                                 ┤
│2024-03-04 20:09:38 Parity Polkadot                                                                                                              ┤
│2024-03-04 20:09:38 ✌️  version 1.8.0-ec7817e5adc                                                                                                 ┤
│2024-03-04 20:09:38 ❤️  by Parity Technologies <[email protected]>, 2017-2024                                                                       ┤
│2024-03-04 20:09:38 📋 Chain specification: Kusama                                                                                               ┤
│2024-03-04 20:09:38 🏷  Node name: METASPAN3 (ALSO TRY POOL #50)                                                                                  ┤
│2024-03-04 20:09:38 👤 Role: AUTHORITY                                                                                                           ┤
│2024-03-04 20:09:38 💾 Database: ParityDb at /data/chains/ksmcc3/paritydb/full                                                                   ┤
│2024-03-04 20:09:40 🚀 Using prepare-worker binary at: "/usr/lib/polkadot/polkadot-prepare-worker"                                               ┤
│2024-03-04 20:09:40 🚀 Using execute-worker binary at: "/usr/lib/polkadot/polkadot-execute-worker"                                               ┤
│2024-03-04 20:09:40 Can't use warp sync mode with a partially synced database. Reverting to full sync mode.                                      ┤
│2024-03-04 20:09:40 🏷  Local node identity is: 12D3KooWPVSft536jRzKSGdiYLx4FQ2XekXrjkNjaSfxpxDCM1LW                                              ┤
│2024-03-04 20:09:40 Warp sync failed. Continuing with full sync.                                                                                 ┤
│2024-03-04 20:09:40 💻 Operating system: linux                                                                                                   ┤
│2024-03-04 20:09:40 💻 CPU architecture: x86_64                                                                                                  ┤
│2024-03-04 20:09:40 💻 Target environment: gnu                                                                                                   ┤
│2024-03-04 20:09:40 💻 CPU: 12th Gen Intel(R) Core(TM) i9-12900K                                                                                 ┤
│2024-03-04 20:09:40 💻 CPU cores: 16                                                                                                             ┤
│2024-03-04 20:09:40 💻 Memory: 128584MB                                                                                                          ┤
│2024-03-04 20:09:40 💻 Kernel: 5.15.0-89-generic                                                                                                 ┤
│2024-03-04 20:09:40 💻 Linux distribution: Ubuntu 22.04.3 LTS                                                                                    ┤
│2024-03-04 20:09:40 💻 Virtual machine: no                                                                                                       ┤
│2024-03-04 20:09:40 📦 Highest known block at #22150878                                                                                          ┤
│2024-03-04 20:09:40 〽️ Prometheus exporter started at 0.0.0.0:9615                                                                               ┤
│2024-03-04 20:09:40 Running JSON-RPC server: addr=127.0.0.1:9944, allowed origins=["http://localhost:*", "http://127.0.0.1:*", "https://localhost┤
│2024-03-04 20:09:40 👶 Starting BABE Authorship worker                                                                                           ┤
│2024-03-04 20:09:40 🥩 BEEFY gadget waiting for BEEFY pallet to become available...                                                              ┤
│2024-03-04 20:09:40 🚨 Some security issues have been detected.                                                                                  ┤
│Running validation of malicious PVF code has a higher risk of compromising this machine.                                                         ┤
│  - Optional: Cannot unshare user namespace and change root, which are Linux-specific kernel security features: not available: unshare user and m┤
│  - Optional: Cannot call clone with all sandboxing flags, a Linux-specific kernel security features: not available: could not clone, errno: EPER┤
│2024-03-04 20:09:40 👮‍♀️ Running in Secure Validator Mode. It is highly recommended that you operate according to our security guidelines.        ┤
│More information: https://wiki.polkadot.network/docs/maintain-guides-secure-validator#secure-validator-mode                                      ┤
│2024-03-04 20:09:40 Sending fatal alert BadCertificate                                                                                           ┤
│2024-03-04 20:09:40 🔍 Discovered new external address for our node: /ip4/195.144.22.130/tcp/30333/p2p/12D3KooWPVSft536jRzKSGdiYLx4FQ2XekXrjkNjaS┤
│2024-03-04 20:09:41 ✨ Imported #22150879 (0xa7de…1dc5)

tunkas · 2024-03-12T10:05:23Z

I remember we had issues with Docker in the past. Are other affected validators also running on Docker?

I'm not running on Docker 🤷🏻‍♂️

eskimor · 2024-03-12T10:40:44Z

One interesting metric would be: "polkadot_parachain_peer_count" for the validation peerset. To be queried like this:

polkadot_parachain_peer_count{protocol=~"validation/.*"}

How do these numbers behave with regards to era points? Once you become a para validator this should be around 1k.

xxbbxb · 2024-03-12T13:20:04Z

yes I can confirm we do experience significant drop of connected peers after updating/restarting the node until that moment we rotate session keys

eskimor · 2024-03-12T22:13:16Z

We found the cause and are working on a fix. What operators can do right away: Make sure the node can persist all data, otherwise it will generate a new PeerId on each restart and this is causing issues.

Polkadot-Forum · 2024-03-13T03:56:57Z

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/async-backing-development-updates/6176/5

alexggh · 2024-03-13T07:00:07Z

Conclusions

There are 3 failure-modes going-on kusama, all create validators with 0 backing rewards:

Validators not updated to versions supporting async-backing, they will get 0 points until they upgrade -> Nothing to be done here.
Validators that change their Peer-ID during restart because it is not persisted, fix should come with Make polkadot behave correctly if node changes peer-id at restart #3673, till then there are ways to workaround this problem by persistenting the PeerID. They will get 0 points after restart until workaround is applied.
Validators reporting that the first session when they enter the active set they get 0 backing points, reported several times in this ticket and here Node massively misses paravalidation votes. #3613. This is not as problematic as the other, because validators recover next session, but a problem nonetheless.

Also thank you @xxbbxb for providing logs and helping me root cause item number 2.

Next priorities in order:

Root-cause what is wrong with validators entering the active set.
Provide fixes for validators changing their PeerID during restart.

This will enable async-backing subsystems on polkadot, the relaychain would work in legacy mode where only candiddates built on top of pervious relay chain are activated. ## Notes - Had to bring an unrelated change as well `minimum_backing_votes` because that was the v6, we can't skip versions, the value of that configuration on polkadot is 2, so that's what is going to be used after this runtime update. Changes have been running on kusama. - `disabled_validators`, `node_features`, `approval_voting_params` are not related with async-backing, but given they are low risk, we decided to include them as well, see comments. ## Known risks Async backing subsytems is a major change in the way collator-protocol and the backing work, so there are still some unknowns if this is completely bug free. It has been running on kusama for a month already, but not without issues: - Validators that did not upgrade to compatible versions will not be able to participate in backing, so if enough of those are randomly picked that group won't be able to back anything. With backing_group_size = 5 and minimum_backing_votes = 2, 10% validator not upgraded, that chance is about 2.5%. - Additionally, same un-upgraded groups won't be able to include the backing votes on chain when they author blocks, so 10% of the blocks won't back any candidates. - We are still not sure if item 2) from here paritytech/polkadot-sdk#3314 (comment) is caused by async backing, the observable issue is sometimes after restart/upgrade some validators are getting 0 rewards and apparently they are not backing anything.   --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: fellowship-merge-bot[bot] <151052383+fellowship-merge-bot[bot]@users.noreply.github.com>

alexggh · 2024-03-17T12:47:00Z

Root-cause what is wrong with validators entering the active set.

Ok, I think I figure it out why this happens for new nodes entering the active set, long story-short, the node will start advertising its AuthorthyId on the DHT only after it becomes active, so the other nodes won't be able to detect it at the beginning of the session. More details about this in: https://github.com/paritytech/polkadot-sdk/pull/3722/files

As discovered during investigation of #3314 and #3673 there are active validators which accidentally might change their network key during restart, that's not a safe operation when you are in the active set because of distributed nature of DHT, so the old records would still exist in the network until they expire 36h, so unless they have a good reason validators should avoid changing their key when they restart their nodes. There is an effort in parallel to improve this situation #3786, but those changes are way more intrusive and will need more rigorous testing, additionally they will reduce the time to less than 36h, but the propagation won't be instant anyway, so not changing your network during restart should be the safest way to run your node, unless you have a really good reason to change it. ## Proposal 1. Do not auto-generate the network if the network file does not exist in the provided path. Nodes where the key file does not exist will get the following error: ``` Error: 0: Starting an authorithy without network key in /home/alexggh/.local/share/polkadot/chains/ksmcc3/network/secret_ed25519. This is not a safe operation because the old identity still lives in the dht for 36 hours. Because of it your node might suffer from not being properly connected to other nodes for validation purposes. If it is the first time running your node you could use one of the following methods. 1. Pass --unsafe-force-node-key-generation and make sure you remove it for subsequent node restarts 2. Separetly generate the key with: polkadot key generate-node-key --file <YOUR_PATH_TO_NODE_KEY> ``` 2. Add an explicit parameters for nodes that do want to change their network despite the warnings or if they run the node for the first time. `--unsafe-force-node-key-generation` 3. For `polkadot key generate-node-key` add two new mutually exclusive parameters `base_path` and `default_base_path` to help with the key generation in the same path the polkadot main command would expect it. 4. Modify the installation scripts to auto-generate a key in default path if one was not present already there, this should help with making the executable work out of the box after an instalation. ## Notes Nodes that do not have already the key persisted will fail to start after this change, however I do consider that better than the current situation where they start but they silently hide that they might not be properly connected to their peers. ## TODO - [x] Make sure only nodes that are authorities on producation chains will be affected by this restrictions. - [x] Proper PRDOC, to make sure node operators are aware this is coming. --------- Signed-off-by: Alexandru Gheorghe <[email protected]> Co-authored-by: Dmitry Markin <[email protected]> Co-authored-by: s0me0ne-unkn0wn <[email protected]> Co-authored-by: Bastian Köcher <[email protected]>

alexggh · 2024-04-29T07:55:40Z

Closing this is not happening anymore and the remaining open action item is being tracked in #3673

KarimJedda added I2-bug The node fails to follow expected behavior. I10-unconfirmed Issue might be valid, but it's not yet known. labels Feb 13, 2024

alexggh mentioned this issue Feb 26, 2024

Add more debug logs to understand if statement-distribution misbehaving #3476

Merged

This was referenced Mar 7, 2024

[async_backing] Bump ParachainHost to api version 10 on polkadot polkadot-fellows/runtimes#222

Merged

Node massively misses paravalidation votes. #3613

Closed

alexggh mentioned this issue Mar 13, 2024

Make polkadot behave correctly if node changes peer-id at restart #3673

Closed

alexggh mentioned this issue Mar 13, 2024

Always print connectivity report #3677

Merged

alexggh mentioned this issue Mar 27, 2024

Prevent accidental change of network-key for active authorities #3852

Merged

2 tasks

alexggh closed this as completed Apr 29, 2024

alexggh mentioned this issue May 24, 2024

Add creation time for DHT authority discovery records polkadot-fellows/RFCs#91

Merged

Kusama parachains block production slowed down #3314

Kusama parachains block production slowed down #3314

Comments

KarimJedda commented Feb 13, 2024 • edited Loading

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

KarimJedda commented Feb 13, 2024

alexggh commented Feb 14, 2024 • edited Loading

Investigation

Potential root-cause(Behaviour observed after 2023-02-12 18.00 PM UTC)

Next steps

ggwpez commented Feb 14, 2024

bkchr commented Feb 14, 2024

bkchr commented Feb 14, 2024

ordian commented Feb 14, 2024

bkchr commented Feb 14, 2024

sandreim commented Feb 14, 2024

bkchr commented Feb 14, 2024

BulatSaif commented Feb 14, 2024

alexggh commented Feb 14, 2024

sandreim commented Feb 14, 2024

BulatSaif commented Feb 14, 2024

alexggh commented Feb 14, 2024

alexggh commented Feb 14, 2024

lovelaced commented Feb 14, 2024

bkchr commented Feb 14, 2024

JelliedOwl commented Feb 14, 2024 • edited Loading

bkchr commented Feb 14, 2024

JelliedOwl commented Feb 14, 2024 • edited Loading

alexggh commented Feb 15, 2024 • edited Loading

bkchr commented Feb 15, 2024

alexggh commented Feb 15, 2024 • edited Loading

tdimitrov commented Feb 15, 2024

bkchr commented Feb 15, 2024

Polkadot-Forum commented Feb 15, 2024

eskimor commented Feb 15, 2024

BulatSaif commented Feb 15, 2024

burdges commented Mar 4, 2024 • edited Loading

sandreim commented Mar 4, 2024

bkchr commented Mar 4, 2024

JelliedOwl commented Mar 4, 2024

burdges commented Mar 4, 2024 • edited Loading

tunkas commented Mar 11, 2024

dcolley commented Mar 12, 2024

eskimor commented Mar 12, 2024

dcolley commented Mar 12, 2024

tunkas commented Mar 12, 2024

eskimor commented Mar 12, 2024 • edited Loading

xxbbxb commented Mar 12, 2024

eskimor commented Mar 12, 2024

Polkadot-Forum commented Mar 13, 2024

alexggh commented Mar 13, 2024 • edited Loading

Conclusions

Next priorities in order:

alexggh commented Mar 17, 2024 • edited Loading

alexggh commented Apr 29, 2024

KarimJedda commented Feb 13, 2024 •

edited

Loading

alexggh commented Feb 14, 2024 •

edited

Loading

JelliedOwl commented Feb 14, 2024 •

edited

Loading

JelliedOwl commented Feb 14, 2024 •

edited

Loading

alexggh commented Feb 15, 2024 •

edited

Loading

alexggh commented Feb 15, 2024 •

edited

Loading

burdges commented Mar 4, 2024 •

edited

Loading

burdges commented Mar 4, 2024 •

edited

Loading

eskimor commented Mar 12, 2024 •

edited

Loading

alexggh commented Mar 13, 2024 •

edited

Loading

alexggh commented Mar 17, 2024 •

edited

Loading