-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add fallback request for req-response protocols #2771
Conversation
Previously, it was only possible to retry the same request on a different protocol name that had the exact same binary protocol. Introduce a way of trying a different request on a different protocol if the first one fails with Unsupported protocol. This helps with adding new req-response versions in polkadot.
I suggest reviewing commit-by-commit. The commit with hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left you some comments, overall the approach looks sane to me, I'll let the networking people tear it apart :D.
}) => { | ||
// Try using the fallback request if the protocol was not | ||
// supported. | ||
if let OutboundFailure::UnsupportedProtocols = error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a few questions about the flow here:
- Does this implies that for every request we have to always go to the other side twice or would it just fail early in lib-p2p before reaching the other side.
- Is sending an unsupported protocol request side effect free, would the node risk being disconnected because it is sending unsupported protocols ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this implies that for every request we have to always go to the other side twice or would it just fail early in lib-p2p before reaching the other side.
It means making two requests to the other peer. I see there's a way of requesting the SupportedProtocols
in libp2p from a peer, but I don't see it being used anywhere.
I think it's fine because:
- we can cache in the interested subsystem the protocol version that a validator uses, to minimise requests.
- since it's encouraged that validators upgrade to the latest version, it shouldn't be long before the v2 request will succeed on most cases (hopefully).
- this is what is already hapenning for the
fallback_names
inProtocolConfig
(but at a lower level, in libp2p)
Is sending an unsupported protocol request side effect free, would the node risk being disconnected because it is sending unsupported protocols ?
AFAIK yes, to quote a comment from rust-libp2p:
// The remote merely doesn't support the protocol(s) we requested.
// This is no reason to close the connection, which may
// successfully communicate with other protocols already.
// An event is reported to permit user code to react to the fact that
// the remote peer does not support the requested protocol(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this is what is already hapenning for the
fallback_names
inProtocolConfig
(but at a lower level, in libp2p)
I think in this case no extra round-trip happens: when opening a new substream for a request, all protocol names (we just call them fallback, but they are equivalent to the main protocol name from libp2p perspective) are sent on the wire and compared to the list of supported protocols on the remote.
As for point 2, looking at the code, the substrate code is not made aware of the request attempts on the unsupported protocol, so can't reduce the peer's reputation — it's all handled inside libp2p.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case no extra round-trip happens: when opening a new substream for a request, all protocol names (we just call them fallback, but they are equivalent to the main protocol name from libp2p perspective) are sent on the wire and compared to the list of supported protocols on the remote.
According to https://github.com/libp2p/specs/blob/master/connections/README.md#protocol-negotiation and the code in rust-libp2p: https://github.com/libp2p/rust-libp2p/blob/b6bb02b9305b56ed2a4e2ff44b510fa84d8d7401/misc/multistream-select/src/dialer_select.rs#L192,
this isn't true. Each protocol in the list is tried one by one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. Disregard my comment please 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I would wait also for review from @altonen as I'm not super familiar with this part of the codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner to store the remote protocols received in an Identify response to Peerstore
and then querying those protocols when sending a request to see what they support. That way the request-response code wouldn't require any modifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this was indeed needed and the approach seems reasonable. The way this was done before was indeed coupling to the notification protocols versioning which is in a way hacky.
I also would want to suggest a better solution for handling protocol versioning in general that relies on capabilities
being exchanged between nodes during handshake. The capabilities should represent all the notification and request protocols the node can talk. We'd make this information available in NotificationEvent::NotificationStreamOpened
and this makes the higher level code (subsystems) more reasonable wrt to which protocol to use when talking to specific nodes.
@altonen @dmitry-markin WDYT ?
These capabilities are exchanged in libp2p's Identify messages and I think we should use that information instead of having a fallback system for the request-response implementation. I don't like the idea of supplying capabilities in I think what could work quite well would be introducing Now that we have custom handshakes and substream validation available in each protocol, we could try and take advantage of that system and advertise the relevant capabilities in the handshake. This is not backwards-compatible though but for any future breaking changes in the networking protocols, it's good to keep in mind that the handshake system is also available now. |
Thanks for the suggestions, I'll explore them a bit |
Agreed that it would be cleaner, this would also spare a second request in case the newer protocol version is not supported. I do see a problem with this however. The req-response behaviour does not require that the We can be conservative on the polkadot side and always use the older protocol version if we don't have access to the supported protocols, but that defeats the purpose of the feature IMO if it happens often enough. |
This is a good point, I had forgotten we use this behavior in Polkadot. Manually instructing |
I think we all agree now that what this PR does is a reasonable approach, right? However, I did hit a roadblock now when testing it with a real request in polkadot. In sc-network, we override the substream upgrade protocol (multistream-select) to be The problem is that libp2p does not provide a way of setting the substream upgrade protocol version on a per-request basis (not even on a per-behaviour basis for that matter). It's a single config (which IMO is obviously not fine), so we'd have to use V1 instead for this PR to work (which would mean an extra round-trip for every single protocol negotiation). I could open an issue and maybe a PR on rust-libp2p for adding support for this, but seeing the super tedious process of updating libp2p in substrate, (#1631 as an example), I'm thinking whether this is at all a good idea. Maybe we can get the maintainers to backport it as a patch to the version we're currently using (although not sure that's possible since it'll add a new API). @altonen let me know if you have a suggestion |
The alternative would be to perform the fallback request at a higher level, in one of the polkadot subsystems, instead of in sc-network. |
I don't think If you're against using |
Thanks, didn't know that!
I'm not against using V1 for now. Considering that it'll be needed for your linked PR, I think it's a reasonable compromise until we get support for handling it in a better way in libp2p. I opened this issue, where I also posted an alternative fix in libp2p.
it's not doable without a change in libp2p AFAICT. The I suggest we move forward with this PR. This PR will not switch to using V1 just yet, because there are no requests that need this fallback yet. It'll be needed in #1644 (until it's merged, maybe we'll have a fix in libp2p or your upgrade PR will switch to using V1 anyway). |
Previously, it was only possible to retry the same request on a different protocol name that had the exact same binary payloads. Introduce a way of trying a different request on a different protocol if the first one fails with Unsupported protocol. This helps with adding new req-response versions in polkadot while preserving compatibility with unupgraded nodes. The way req-response protocols were bumped previously was that they were bundled with some other notifications protocol upgrade, like for async backing (but that is more complicated, especially if the feature does not require any changes to a notifications protocol). Will be needed for implementing polkadot-fellows/RFCs#47 TODO: - [x] add tests - [x] add guidance docs in polkadot about req-response protocol versioning
**Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged** Fixes #598 Also implements [RFC #47](polkadot-fellows/RFCs#47) ## Description - Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them). - Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro). - Introduces a `ValidatorIndex` -> `ChunkIndex` mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the [NodeFeatures runtime API](#2177) so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved. - Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used. - Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before. ## Performance results Some results from subsystem-bench: with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s End-to-end results here: #598 (comment) #### TODO: - [x] [RFC #47](polkadot-fellows/RFCs#47) - [x] merge #2177 and rebase on top of those changes - [x] merge #2771 and rebase - [x] add tests - [x] preliminary performance measure on Versi: see #598 (comment) - [x] Rewrite the implementer's guide documentation - [x] #3065 - [x] paritytech/zombienet#1705 and fix zombienet tests - [x] security audit - [x] final versi test and performance measure --------- Signed-off-by: alindima <[email protected]> Co-authored-by: Javier Viola <[email protected]>
**Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged** Fixes paritytech#598 Also implements [RFC paritytech#47](polkadot-fellows/RFCs#47) ## Description - Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them). - Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro). - Introduces a `ValidatorIndex` -> `ChunkIndex` mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the [NodeFeatures runtime API](paritytech#2177) so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved. - Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used. - Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before. ## Performance results Some results from subsystem-bench: with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s End-to-end results here: paritytech#598 (comment) #### TODO: - [x] [RFC paritytech#47](polkadot-fellows/RFCs#47) - [x] merge paritytech#2177 and rebase on top of those changes - [x] merge paritytech#2771 and rebase - [x] add tests - [x] preliminary performance measure on Versi: see paritytech#598 (comment) - [x] Rewrite the implementer's guide documentation - [x] paritytech#3065 - [x] paritytech/zombienet#1705 and fix zombienet tests - [x] security audit - [x] final versi test and performance measure --------- Signed-off-by: alindima <[email protected]> Co-authored-by: Javier Viola <[email protected]>
**Don't look at the commit history, it's confusing, as this branch is based on another branch that was merged** Fixes paritytech#598 Also implements [RFC paritytech#47](polkadot-fellows/RFCs#47) ## Description - Availability-recovery now first attempts to request the systematic chunks for large POVs (which are the first ~n/3 chunks, which can recover the full data without doing the costly reed-solomon decoding process). This has a fallback of recovering from all chunks, if for some reason the process fails. Additionally, backers are also used as a backup for requesting the systematic chunks if the assigned validator is not offering the chunk (each backer is only used for one systematic chunk, to not overload them). - Quite obviously, recovering from systematic chunks is much faster than recovering from regular chunks (4000% faster as measured on my apple M2 Pro). - Introduces a `ValidatorIndex` -> `ChunkIndex` mapping which is different for every core, in order to avoid only querying the first n/3 validators over and over again in the same session. The mapping is the one described in RFC 47. - The mapping is feature-gated by the [NodeFeatures runtime API](paritytech#2177) so that it can only be enabled via a governance call once a sufficient majority of validators have upgraded their client. If the feature is not enabled, the mapping will be the identity mapping and backwards-compatibility will be preserved. - Adds a new chunk request protocol version (v2), which adds the ChunkIndex to the response. This may or may not be checked against the expected chunk index. For av-distribution and systematic recovery, this will be checked, but for regular recovery, no. This is backwards compatible. First, a v2 request is attempted. If that fails during protocol negotiation, v1 is used. - Systematic recovery is only attempted during approval-voting, where we have easy access to the core_index. For disputes and collator pov_recovery, regular chunk requests are used, just as before. ## Performance results Some results from subsystem-bench: with regular chunk recovery: CPU usage per block 39.82s with recovery from backers: CPU usage per block 16.03s with systematic recovery: CPU usage per block 19.07s End-to-end results here: paritytech#598 (comment) #### TODO: - [x] [RFC paritytech#47](polkadot-fellows/RFCs#47) - [x] merge paritytech#2177 and rebase on top of those changes - [x] merge paritytech#2771 and rebase - [x] add tests - [x] preliminary performance measure on Versi: see paritytech#598 (comment) - [x] Rewrite the implementer's guide documentation - [x] paritytech#3065 - [x] paritytech/zombienet#1705 and fix zombienet tests - [x] security audit - [x] final versi test and performance measure --------- Signed-off-by: alindima <[email protected]> Co-authored-by: Javier Viola <[email protected]>
Previously, it was only possible to retry the same request on a different protocol name that had the exact same binary payloads.
Introduce a way of trying a different request on a different protocol if the first one fails with Unsupported protocol.
This helps with adding new req-response versions in polkadot while preserving compatibility with unupgraded nodes.
The way req-response protocols were bumped previously was that they were bundled with some other notifications protocol upgrade, like for async backing (but that is more complicated, especially if the feature does not require any changes to a notifications protocol). Will be needed for implementing polkadot-fellows/RFCs#47
TODO: