-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Parachain block time degradation since v0.9.16 on Kusama #4911
Comments
I've done some test, and I suspect the nodes are getting overloaded with connections in critical times. Using unoptimized binary is making the result worse also. That would explain why Kusama network is doing worse than Polkadot at the moment |
It might be related to some optimizations @eskimor has done around collation networking |
https://github.com/paritytech/polkadot/pull/4640/files#r805223564 Looks suspicious. I pushed a branch reverting this fix to https://github.com/paritytech/polkadot/tree/rh-revert-connectivity-fix if you'd like to try it out. |
@rphmeier Ok I'm trying it now, I'll see if this can be deployed now or will have to wait tomorrow morning |
@rphmeier Can it be applied only to validator or do the collator also need it ? |
@crystalin Specifically the collators need it. This should only affect collator logic. That might be an argument against this being the underlying cause, because AFAIK it began even before the collators were upgraded. |
@rphmeier Yes it began while we were still on 0.9.13 collators. It also begin precisely when Kusama validator started to use v0.9.16 |
We noticed increased block time since Kusama nodes upgrades to 0.9.16. We released Karura with 0.9.16 2 days ago and see block number increased more but unsure if it is related to our upgrade or more Kusama nodes have upgraded. |
Thank you @xlc , We also observed on some of our networks a small increase when the collators upgraded to support v0.9.16, but it was very small compared to the increase of block time since Kusama upgraded to v0.9.16
|
@rphmeier I tried to look more at the charts. I confirm the connection spikes happen when the validator get in the parachain core set. I had initially the idea that it could have been triggered by other parachains migrating to v0.9.16 on Kusama, but I don't think that is the case anymore. |
We can confirm that we observe a similar issue in Calamari. |
paritytech/substrate@polkadot-v0.9.15...polkadot-v0.9.16 These are all Substrate changes from that period It doesn't look like lip2p changed at all; both versions These are all Polkadot changes. It seems that we're looking for stuff that is relevant to the validator nodes and excluding the runtime. That narrows things down a decent bit.
|
@rphmeier I've double checked again today, but on our Alphanet, the degradation didn't start on the relay client upgrade v0.9.16 but the day after when we upgraded our collators to v0.9.16. I'm not sure what could trigger those 2 scenarios. Either some networking conditions that could have been triggered on Kusama earlier due to its size, and was re-enforced by upgraded parachain clients to v0.9.16 (which would have triggered it on Alphanet). @rphmeier is it safe to revert validators to v0.9.15 on Alphanet with runtime 9160 ? (I know there is a new host, but I'm not sure if/when it is actually used) |
I don't think so. |
@rphmeier also are there any log (we could enable debug for all modules for a short amount of time) on our validators that would help to identify the issue ? |
Can we get a graph on Grafana indicating the amount of reorgs? |
@rphmeier I realize... that actually the re-org message only applies in some cases, but the problematic one, where mutliple blocks of the same height (best) is received is not triggering a reorg:
As you can see here, there are 2 EDIT:
|
We'd like to include #4914 into the next release (and collators as well), even though it doesn't explain all of the issues we're observing here. |
Reducing Alphanet selected Collator from 100 to 30 didn't change the block time degradation. |
We've tweaked one of our internal network (stagenet) to have 50 collators (on v0.9.16 support) but with only 8 selected to produce blocks (always the same). The block time increased: That means that even when collators are not producing blocks they are impacting the block time (@rphmeier) |
@crystalin Just to make absolutely sure we're measuring the same thing can you please share your Grafana query you're using to produce that "Block Time [60m]" graph? |
From the data on Rococo, we can see a degradation in the collator protocol: with |
Thank you @Mitchell-Grimes. |
Based on the data and experiments so far we have 2 further tests planned to confirm the fix:
|
I dug into Statemine's performance. There are sharp differences observable in the behavior of the collators. Collator MetricsThe green line is the number of advertisements made by collators. The yellow line is the number of advertised collations requested by validators. And the blue line is the number of collations generated. Collator VersionsInitial Comments
Rate of Progression of StatemineThis graph shows the rate of progression of the statemine chain. The artifacts around Feb 8 and Feb 10 represent places where the metric format changed and I had to stitch the graphs together with a few queries. We can see that prior to 0.9.16 on Feb 4, statemine produced around 0.074 blocks/s or about 1 block every 13.5 seconds. This is close to the ideal of 12 with a few issues. From that point until around Feb 11, statemine gradually degraded to around .05 block/s and then continued to degrade over time to around .045 blocks per second on Feb 28. After Feb 28, block production rates have recovered to around .048. Overall, we're looking at a reduction in block production rate of around 35%-40%. Proportion of Advertisements RequestedWe can see a sharp jump up to ~90% after statemine collators were upgraded to 0.9.16 on Feb 10. Normalized to Validator Group SizesHere is a normalized version of the first graph where the advertisements and requests graphs are normalized by a factor of 5, to account for the fact that Kusama validator groups have 5 validators in them. So this represents "advertisements/requests per validator in group" as compared to collations, as opposed to "total advertisements/requests". Again, we see the same drop-off where previous to 0.9.16 on Feb 10, only half of advertisements were requested and this is because only half of advertisements were to validators that would act on them. Ignoring the extra advertisements to validators in an irrelevant group, the number of advertisements per validator per collation has degraded from around 2/3 to 1/2 and the number of advertisements per validator have degraded by 1/2 in absolute terms. This might indicate issues for collators to either discover or connect to validators. Advertisements per Validator per CollationThe green line assumes 5 validators per core and the yellow assumes 10, which accounts for the advertisements to 2 groups. Comparing the left side of the yellow line to the right side of the green line, we can see that the absolute amount of advertisements per validator per collation has dropped from around .65 to .55, or a ~15% drop-off (this is a more accurate claim than the 2/3->1/2 claim made above). Also, this is worryingly far from the expected '1' in either case. This has dropped off further after the upgrade of statemine collators on Feb 28. Conclusions
Open questions: what proportion of the reduction in collation production is explained by more time spent in availability distribution? |
I had a bit of time so I took a sample of the amount of time spent in availability (this was on Moonriver) on Kusama before 0.9.16 and from earlier today, and found that this changed from 1.0569 blocks to 1.8742 blocks, which is a 77.3% increase. This explains 6*.773=4.64s of reduced block time. Block times have changed from around 1 block per 13.5s to 1 block per 21s for a difference of 7.5s so additional time spent in availability explains about 62% of the performance degradation. Availability cores being occupied for 77.3% more blocks is also possibly a reasonable explanation for a reduction in the number of collations produced, as collators don't produce collations on top of relay parents where the core is occupied. |
UPDATE: Versi collators seem to be down: https://grafana.parity-mgmt.parity.io/goto/LdoXPYL7k?orgId=1. We are producing blocks, so I imagine that the logs are hooked up to the wrong collators. This might be exacerbating the issue we are observing or even causing a completely different issue. After aligning with the Purestake team, we were able to reproduce lapses in block times on Versi. Originally, we added 200ms latency to the chaos network for Versi and immediately experienced large lapses in parachain block times. In addition, relay chain block times were regularly 12s and spike to 18s at times. The largest lapse in parachain blocks was approximately 3 mins. In order to further our hypothesis, we removed the latency entirely and observed the relay chain block times. To our surprise, we still observed similar relay chain block times (though more infrequent than previously). Specifically, we can observe large lapses in bitfield-signing as can be confirmed here: https://grafana.parity-mgmt.parity.io/goto/eAnc-LYnz?orgId=1 (flip results order to observe large time-spans of Looking into the composition of the bitfield array, the computation of the availability boolean for a particular core evaluates to
Example (flip results order) We observe large spurts of the On the other hand, if we verify the Therefore, we must conclude that the core is not Occupied (it is either Scheduled or Fee) during the large lapses in parachain blocks, as we would otherwise expect to see a log line for the More importantly, we observe no candidate backing or candidate inclusion throughout this outage. Looking closely, we observe that candidate This indicates that there may be a problem in
If the core is not occupied, it stands to reason that the collators are in some way misinterpreting this information and not producing collations even though the core is being reported as free for all validators. During this time, we observe large outages in approval-distribution noting that an assignment from a peer is out of view: (https://grafana.parity-mgmt.parity.io/goto/st34YLLnz?orgId=1). An important observation from this analysis is that the long relay-chain block times were not fixed by removing the chaos network latency. A noteworthy mention is that one of the validators seems to have crashed at some point (but is now running fine): https://grafana.parity-mgmt.parity.io/goto/-yrMYYL7k?orgId=1 |
^ average time-in-flight for intra-node messages being sent to the approval-distribution subsystem. This appears to cause a cascading effect causing other network subsystems to slow down. ^ Nodes are receiving massive amounts of intra-node messages on the approval-distribution subsystem, far beyond the expected amounts. These probably originate in the network, but might originate in the approval-voting subsystem. Nodes are sending tons of messages to the network-bridge. ^ This shows that the number of notifications received on the network corresponds very closely to the approval-distribution subsystem, indicating that the messages originate in the network. The gap represents messages from other subsystems. Optimizing approval-distribution will probably help a lot. |
Can we close this issue now? |
Well, I don't think the issue is fully resolved. @rphmeier , Currently on Moonriver, block average time is around 14s (as opposed to 12.5s on Moonbeam), so I think there is still improvement to be done, what do you think ? |
Going to close this; the next round of improvements will come with asynchronous backing. |
We are observing pretty impactful degradation on Moonriver since the deployment of client v0.9.16 on Kusama.
Additionally, we have upgraded few days after Moonriver to support v0.9.16 but it had no impact.
We have been able to reproduce that on our public Alphanet (20 controlled validators, on v0.9.16)
Here are the logs for a given block not being included #1690751
alphanet-block-degradation-block-1690751.log
Here is the graph showing the average time block over the last 7 days:
(The spike is only the restart of our nodes, ignore it)
The text was updated successfully, but these errors were encountered: