Reduce network bandwidth, improve parablock times: optimize approval-distribution #5164

rphmeier · 2022-03-21T02:52:15Z

Pre-reqs: Gossip-Support changes

This comes with a number of changes to gossip-support to make this possible.

The grid topology is identified by the session, and peers' ValidatorIndex values are given with their AuthorityDiscoveryId and PeerId.

Connection requests are now made against past/present/future AuthorityDiscoveryIds, but the grid is only based on validator IDs. Previously, the grid was based on past/present/future AuthorityDiscoveryIds, but this behavior was undesirable for the obvious reason that if validators rotate their keys often, the grid might be 5/6 populated by dead or duplicate entries. These issues in the topology could lead to failures to propagate or excessive propagation, respectively.

The reason we connect to past/present/future AuthorityDiscoveryIds even though we discount them from the session's grid topology is that we still need to communicate with those old/future authorities and being generally connected guarantees good req/res behavior. And gossip subsystems like approval-distribution will want to gossip with different peers based on the session.

For Statement Distribution and Bitfield Distribution, adjusting the grid topology doesn't matter much because we only care about the most recent session, and failures here only impact parachain liveness. For approval distribution, this runs some risk that assignments/approvals don't propagate between validator sets of different sessions. However, assignments and approvals only need to reach the validators of the session of the block, as those are the GRANDPA voters assigned to finalize those blocks. Actually, because sessions are delayed by one block in the parachains protocol, the last block of any session S is actually finalized by the validators of S+1, but since validator sets shouldn't change much and we incorporate random gossip, this isn't likely to cause issues. But we have to be more careful in this subsystem because it's important for parachain safety and relay chain liveness.

Actual Subsystem Changes

This PR alters the approval-distribution subsystem to use the deterministic grid topology of validators more effectively, while implementing fallbacks in the case that the grid topology is compromised.

Every session, validators are organized into a (close to square) grid, where each validator has row-neighbors and column-neighbors.

The basic operation of the 2D grid topology is that:

A validator producing a message sends it to its row-neighbors and its column-neighbors
A validator receiving a message originating from one of its row-neighbors sends it to its column-neighbors
A validator receiving a message originating from one of its column-neighbors sends it to its row-neighbors

This grid approach defines 2 unique paths for every validator to reach every other validator in at most 2 hops.

However, we also supplement this with some degree of random propagation: every validator, upon seeing a message for the first time, propagates it to 8 random peers. This includes the originator of the message. This inserts some redundancy in case the grid topology isn't working or is being attacked - an adversary doesn't know which peers a validator will send to. This is combined with the property that the adversary doesn't know which validators will elect to check a block.

But, in case these mechanisms don't work on their own, we need to trade bandwidth for protocol liveness by introducing aggression.

Aggression has 3 levels:

Aggression Level 0: The basic behaviors described above.
Aggression Level 1: The originator of a message sends to all peers. Other peers follow the rules above.
Aggression Level 2: All peers send all messages to all their row and column neighbors. This means that each validator will, on average, receive each message approximately 2*sqrt(n) times.

The aggression level of messages pertaining to a block increases when that block is unfinalized and is a child of the finalized block. This means that only one block at a time has its messages propagated with aggression > 0.

Also, we re-send messages every few blocks to all peers required by our aggression level.

Lastly, there is also redundancy in the form of the 'no-shows' feature of the core approval-voting code. If, for whatever reason, some approval messages aren't propagating through the grid or random gossip, more and more validators will elect to check the para-block, covering prior no-shows. If some assignment messages aren't getting through, it'll also cause more validators to self-select.

roadmap/implementers-guide/src/types/network.md

…adot into rh-approval-grid

drahnr

We should clarify which random source to use. Other than that, LGTM

rphmeier · 2022-04-11T20:52:02Z

We should clarify which random source to use.

@drahnr What do you mean by this?

I added a CryptoRng which is used throughout the subsystem but it's not exposed as part of the public API of the subsystem (StdRng is used in fn run). This gives the tests some more determinism.

drahnr · 2022-04-12T05:50:38Z

Sorry for the brevity, I meant this comment in particular: https://github.com/paritytech/polkadot/pull/5164/files#r835130703 - where do we take the entropy from to initialize the cprng? After all this defines the gossip topology and should be consistent across nodes (or a sufficiently large subset) to work as anticipated iiuc. So updating later has to be done with more diligence from what I understand. (I did not do the math for this yet)

burdges · 2022-04-12T15:37:33Z

Availability demands all validators have network connections with all other validators anyways. We thus do not care if different messages obey have different typologies. We do not want a bad validator to be able to spam by sending the same message with multiple topology, so whatever randomness we use needs to be verified form the chain state. It depends upon the validator set anyways, so anything we do has resolution no coarser than babe two epochs ago, but it's just fine to use say babe two epochs ago hashed with the senders' public key.

rphmeier · 2022-04-12T23:47:12Z

@burdges the comment was about whether we should draw the (same) randomness value from the BABE runtime API or through the SessionInfo interface exposed from the parachains runtime API. In both cases they're BABE from 2 epochs ago, but there is a potential OBO edge case when switching from one to the other. To be addressed in a follow-up PR.

ordian

Looks good overall.

It'd be good to have high level overview and goals of aggression levels mentioned in the guide as well.

node/network/approval-distribution/src/metrics.rs

node/network/gossip-support/src/lib.rs

ordian · 2022-04-19T13:16:18Z

node/network/approval-distribution/src/lib.rs

+			(false, false) => RequiredRouting::None,
+			(true, false) => RequiredRouting::GridY, // messages from X go to Y
+			(false, true) => RequiredRouting::GridX, // messages from Y go to X
+			(true, true) => RequiredRouting::GridXY, // if the grid works as expected, this shouldn't happen.


I have a slight preference for use always_assert::never as we do in PVF subsystem.

Panics are for unrecoverable errors. This might be unexplainable, but it's obviously recoverable. Panics in our use-case are playing with fire. We've written bugs before and will do it again.

This thing is like a debug_assert in a nice wrapping. So this will panic only in CI/testnet, which is better than not noticing it at all.

I see. Well, we could do a follow-up for that. I'm not really opinionated about debug-asserts.

bkchr

Only approving the runtime lib.rs changes.

runtime/polkadot/src/lib.rs

rphmeier added 3 commits March 20, 2022 19:54

gossip-support: be explicit about dimensions

5315731

some guide updates

e56a1d2

update network-bridge to distinguish x and y dimensions

a7f1bc0

rphmeier added the A3-in_progress Pull request is in progress. No review needed at this stage. label Mar 21, 2022

get everything to compile

8c6b4a5

ordian reviewed Mar 21, 2022

View reviewed changes

roadmap/implementers-guide/src/types/network.md Outdated Show resolved Hide resolved

rphmeier added 24 commits March 21, 2022 18:15

beginnings

38e998b

some TODOs

92925da

polkadot runtime: use relevant_authorities

d888b4b

make gossip topologies per-session

167468c

better formatting

221ffb0

gossip support: use current session validators

66e79bf

expand in comment

dd8a300

adjust tests and fix index bug

cfa3532

add past/present/future connection test and clean up code

c934eef

fmt

6531322

network bridge: updated types

a3214b9

update protocols to new gossip topology message

2b5c787

guide updates

8a4d0ac

Merge branch 'master' into rh-approval-grid

6171803

add session to BlockApprovalMeta

3d32758

add session to block info

a645fb3

refactor knowledge and remove most unify logic

1ec4957

start replacing gossip_peers with new SessionTopologies

7d790cc

add routing information to message state

6d8717a

add some utilities to SessionTopology

7da64de

implement new gossip topology logic

0a3ffa7

re-implement unify_with_peer

476c0b0

distribute assignments according to topology

6263ab1

finish grid topology implementation

2e7d047

Merge branch 'rh-approval-grid' of https://github.com/paritytech/polk…

4381479

…adot into rh-approval-grid

paritytech-ci requested review from a team April 10, 2022 20:10

rphmeier added 5 commits April 10, 2022 22:17

add back unify_with_peer logs

1522b79

make Resend an enum

9fc7980

be more explicit when resending

14c3ea7

fmt

d0dc36d

fix error

d406a2c

drahnr approved these changes Apr 11, 2022

View reviewed changes

rphmeier added 2 commits April 13, 2022 01:34

Merge branch 'master' into rh-approval-grid

1657d86

add a TODO for refactoring

43a4123

ordian mentioned this pull request Apr 13, 2022

Gossip-support: draw random seed out of SessionInfo instead of invoking BABE API paritytech/polkadot-sdk#842

Open

ordian reviewed Apr 19, 2022

View reviewed changes

rphmeier added 3 commits April 19, 2022 10:32

remove debug metrics

383e7a0

add some guide stuff

c8782c6

fmt

a8b7bf8

bkchr approved these changes Apr 19, 2022

View reviewed changes

andresilva reviewed Apr 19, 2022

View reviewed changes

runtime/polkadot/src/lib.rs Show resolved Hide resolved

update runtime API in test-runtim

2f790ef

eskimor merged commit 8bb84d2 into master Apr 19, 2022

eskimor deleted the rh-approval-grid branch April 19, 2022 18:26

eskimor mentioned this pull request Apr 20, 2022

Grid topology for bitfield distribution #5358

Closed

Kailai-Wang mentioned this pull request May 4, 2022

Update dependencies to polkadot-v0.9.20 litentry/litentry-parachain#562

Closed

Dengjianping mentioned this pull request Jul 7, 2022

Bump deps to v0.9.22 Manta-Network/Manta#571

Merged

10 tasks

jakoblell added D1-audited 👍 PR contains changes to critical logic that has been properly reviewed and externally audited. and removed D9-needsaudit 👮 PR contains changes to fund-managing logic that should be properly reviewed and externally audited. labels Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce network bandwidth, improve parablock times: optimize approval-distribution #5164

Reduce network bandwidth, improve parablock times: optimize approval-distribution #5164

rphmeier commented Mar 21, 2022 •

edited

Loading

drahnr left a comment

rphmeier commented Apr 11, 2022 •

edited

Loading

drahnr commented Apr 12, 2022 •

edited

Loading

burdges commented Apr 12, 2022

rphmeier commented Apr 12, 2022

ordian left a comment

ordian Apr 19, 2022

rphmeier Apr 19, 2022

ordian Apr 19, 2022

rphmeier Apr 19, 2022 •

edited

Loading

bkchr left a comment

Reduce network bandwidth, improve parablock times: optimize approval-distribution #5164

Reduce network bandwidth, improve parablock times: optimize approval-distribution #5164

Conversation

rphmeier commented Mar 21, 2022 • edited Loading

Pre-reqs: Gossip-Support changes

Actual Subsystem Changes

drahnr left a comment

Choose a reason for hiding this comment

rphmeier commented Apr 11, 2022 • edited Loading

drahnr commented Apr 12, 2022 • edited Loading

burdges commented Apr 12, 2022

rphmeier commented Apr 12, 2022

ordian left a comment

Choose a reason for hiding this comment

ordian Apr 19, 2022

Choose a reason for hiding this comment

rphmeier Apr 19, 2022

Choose a reason for hiding this comment

ordian Apr 19, 2022

Choose a reason for hiding this comment

rphmeier Apr 19, 2022 • edited Loading

Choose a reason for hiding this comment

bkchr left a comment

Choose a reason for hiding this comment

rphmeier commented Mar 21, 2022 •

edited

Loading

rphmeier commented Apr 11, 2022 •

edited

Loading

drahnr commented Apr 12, 2022 •

edited

Loading

rphmeier Apr 19, 2022 •

edited

Loading