Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache request protocol version in availability-recovery #3127

Closed
alindima opened this issue Jan 30, 2024 · 1 comment
Closed

Cache request protocol version in availability-recovery #3127

alindima opened this issue Jan 30, 2024 · 1 comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Comments

@alindima
Copy link
Contributor

Prerequisite: #1644

See #1644 (comment) for details of the improvements

@alindima alindima added T0-node This PR/Issue is related to the topic “node”. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jan 30, 2024
@alindima
Copy link
Contributor Author

There are a couple of ways to implement this:

  1. Record the protocol used for receiving the response from our peer and cache it in the subsystem. This gets quite messy to implement since we run recoveries in parallel for multiple candidates so we'd need to have and mutate a shared cache between recovery tasks.
  2. Expose the responses of the Identify protocol and record them in the subsystem. These contain the list of supported protocols of our peer and is being fetched on a new connection. This needs some extra support in substrate

Now, the caveat of both approaches is that they are an optimisation that's only effective while not all validators are upgraded. Once they're all upgraded, the code will be redundant and would potentially send/record unnecessary events.
Moreover, production networks rarely do chunk recovery for now. Most of the time they simply fetch the full data from backers (since most POVs are less than 128Kib in compressed size).

In the worst case, with a mixed validator set (half updated, half unupdated), the updated nodes will make an extra round-trip when fetching chunks from unupdated nodes.

I measured this in practice and the cost is negligible considering total POV recovery time.

Measuring this with subsystem-bench (with an extra latency of 100ms for the second request):

Screenshot 2024-05-22 at 17 12 42

The first half simulates all nodes making 2 round trips for all chunk requests.

I also measured this in versi, with 50 validators and 9 glutton parachains and POVs of 2.5 mib.

The average PoV recovery time with all unupgraded nodes is 528ms.
The average PoV recovery time will half upgraded and half unupgraded nodes is 674 ms.

As you can see, the large consumer of recovery time is reed-solomon.

Considering all the above, I'll close this issue and conclude that this small optimisation is not worth implementing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
Status: Completed
Development

No branches or pull requests

1 participant