Add nonce to repair request/response #9903

carllin · 2020-05-06T21:08:04Z

Problem

People could send arbitrary shreds for ingestion via the repair port

Summary of Changes

Add a nonce to repair/responses that needs to match in order for response to be accepted

Reduce shred by 4 byte size of nonce: https://github.com/solana-labs/solana/pull/9903/files#diff-dc91d3a3b6051569c3c1d9979c3e4df1R46. New shred size will be: https://github.com/solana-labs/solana/pull/9903/files#diff-dc91d3a3b6051569c3c1d9979c3e4df1R50
Because we still have to support old shred size as well, changes like: https://github.com/solana-labs/solana/pull/9903/files#diff-dc91d3a3b6051569c3c1d9979c3e4df1R896 are necessary.

Backwards Compatibility:

(Painful, but good exercise in what we want to do for any future changes to shreds :P)

Window service has been modified to accept both the old and the new shred sizes, where previously it was assumed that the shred was the entire packet.data.

Verified here:
https://github.com/solana-labs/solana/pull/9903/files#diff-65dbf75f4753cd8c1a130e96e72cd1b4R242-R246
Shred payload taken here: https://github.com/solana-labs/solana/pull/9903/files#diff-65dbf75f4753cd8c1a130e96e72cd1b4R226

Due to new shred size, changes like https://github.com/solana-labs/solana/pull/9903/files#diff-dc91d3a3b6051569c3c1d9979c3e4df1R750 and https://github.com/solana-labs/solana/pull/9903/files#diff-dc91d3a3b6051569c3c1d9979c3e4df1R843 are needed to guarantee erasure and de-shredding still work with old shred sizes.
Repair request structure has changed to include nonce: https://github.com/solana-labs/solana/pull/9903/files#diff-dc108ad584300fa57f99d1675ccff3b9R93-R95. TODO: support old version as well

Proposed Upgrade Path:

Land Backwards Compatibility 1) and 2) above without changing the shred size so that everyone has the code to parse/perform erasure/deshred on smaller shreds. This way when we upgrade, people will understand turbine shreds regardless of whether they have upgraded.
This step is dependent on knowing what version people in gossip are on, @mvines is there a way to do that? If not I was thinking maybe we could set the storage_addr in ContactInfo as a flag.

Perform upgrade and modify repair such that:

only make nonce repair requests to people on new version.
only serve nonce repair responses to people on new version.

Fixes #

codecov · 2020-05-06T23:37:18Z

Codecov Report

Merging #9903 into master will decrease coverage by 0.0%.
The diff coverage is 79.4%.

@@           Coverage Diff            @@
##           master   #9903     +/-   ##
========================================
- Coverage    80.4%   80.4%   -0.1%     
========================================
  Files         287     289      +2     
  Lines       66539   67016    +477     
========================================
+ Hits        53555   53922    +367     
- Misses      12984   13094    +110

mvines · 2020-05-08T06:34:30Z

Darn, all those nice links you setup in the description don't work @carllin :(

mvines · 2020-05-08T06:39:51Z

This step is dependent on knowing what version people in gossip are on, @mvines is there a way to do that? If not I was thinking maybe we could set the storage_addr in ContactInfo as a flag.

I've been wanting a way to determine Solana version from gossip actually, now that everybody is locking down RPC.

Stealing bits from the storage_add in ContactInfo sounds great for this! I'd do that as a completely independent PR that I'd actually love to backport all the way to 1.0.

It would be amazing if solana-gossip spy also included software version info for all the nodes

carllin · 2020-05-08T07:45:41Z

Darn, all those nice links you setup in the description don't work @carllin :(

Ah shucks, updated all of them, hopefully they work now?

mvines

This is looking really good overall. I think you should hack this PR to little bits though, so we can incrementally roll this out to the clusters.

For example we know we'll need that shred nonce, so we can carve that bit out ASAP and get mainnet-beta/testnet using it. That looks like the stickiest part of enabling a rolling update.

ledger/src/repair_response.rs

mvines · 2020-05-09T17:02:07Z

core/src/outstanding_requests.rs

@@ -0,0 +1,271 @@
+use crate::request_response::RequestResponse;


nit: outstanding_requests feels too generic to me in core: outstanding_repair_requests. If we had that repair crate then calling this module outstanding_requests would be fine 😵

@mvines, so the way it's currently set up, outstanding_requests::OutstandingRequests<T, S> is designed to be more general than just repair, it can track any request type T that implements T: RequestResponse<Response = S>.

For instance gossip could potentially initialize an OutstandingRequests<T, S> for gossip requests/responses as well.

Is that a thing we need though? Building generic interfaces in the hopes that somebody in the future will use them when there's only a single consumer today usually doesn't end too well.

@mvines yeah I would agree in mosts cases, but there was a lot of talk of a similar solution to solve the gossip replay issue here: #9491.

It was a also a nice way to logically bundle the verification of responses into the actual request object like this: https://github.com/solana-labs/solana/pull/9903/files#diff-dc108ad584300fa57f99d1675ccff3b9R66, so that they fit together rather than in disparate functions.

core/src/serve_repair.rs

ledger/src/sigverify_shreds.rs

mvines · 2020-05-09T17:11:30Z

Separate thread for repair inserts so that grabbing the lock/verifying nonce doesn't block turbine shreds. Also greedily send shreds for insert: https://github.com/solana-labs/solana/pull/9903/files#diff-65dbf75f4753cd8c1a130e96e72cd1b4R242-R246 instead of blocking waiting for entire batch.

This feels like it's own PR too.

carllin · 2020-05-13T09:54:33Z

Tested by setting the "crossover" slot (the slot at which the nonces are turned on) to 1200 on a testnet, and also induced 3-min long partitions every 10 mins so that the cluster was partitioned during the "crossover" point, and at many points afterwards testing the repair path.

After a couple of edge cases, the results, no panics after an hour:

Time to split it up and it should be ready to go!

carllin · 2020-05-14T00:24:44Z

Separate thread for repair inserts so that grabbing the lock/verifying nonce doesn't block turbine shreds. Also greedily send shreds for insert: https://github.com/solana-labs/solana/pull/9903/files#diff-65dbf75f4753cd8c1a130e96e72cd1b4R242-R246 instead of blocking waiting for entire batch.

This feels like it's own PR too.

Removed that change as it didn't yield the expected perf benefits

sakridge · 2020-05-14T01:28:22Z

ledger/src/sigverify_shreds.rs

+        slots_iter = slots.into_iter();
+        slots_iter_ref = &mut slots_iter;
+    }
+    for (batch, slots) in batches.iter().zip(slots_iter_ref) {


Is it easier to just read the slot out of the shred data here?

mvines · 2020-05-18T04:49:03Z

@carllin - thanks for landing the v1.1 version of shred nonce. Rolling out the v1.0 version and master quickly would be very nice too. We're coming up on branch day for v1.2 (next Tuesday!)

stale · 2020-05-25T04:51:09Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

carllin requested review from sakridge and mvines May 6, 2020 21:08

mvines reviewed May 9, 2020

View reviewed changes

carllin force-pushed the FixRepair branch 9 times, most recently from 6977ef6 to 9eb8c57 Compare May 12, 2020 22:08

carllin added 13 commits May 12, 2020 16:20

Track outstanding repair requests

0af195a

Refactor to nonce key

5793015

Add RequestResponse, move outstanding request check to window_service

fc6c180

Fix tests

74c72b0

Separate insert repair thread

d545573

Add nonce to requests

c50c461

Reduce shred by nonce size

e99f31d

Parse nonce in window_service

355ea6a

Support both shred sizes for backwards compatibility

009cb95

Fixes

5122e8e

Finer grained locking

6586727

Add cleanup

21afd4a

clippy fixes

669a040

Fix test

6460faa

carllin force-pushed the FixRepair branch 5 times, most recently from 56f7e3f to 86cd4c5 Compare May 13, 2020 07:28

carllin added 4 commits May 13, 2020 00:42

Add switch shred size based on slot

fef1ac1

Make shred size check stricter

f39d5cd

Window service support both types of repairs

a380447

Add to Repair enum to support both nonce and non-nonced repairs

d1d47e9

carllin force-pushed the FixRepair branch from 86cd4c5 to d1d47e9 Compare May 13, 2020 07:42

carllin added 2 commits May 13, 2020 15:31

Revert separate window thread

ebbe5e4

Move repair_response to core

dfab3a4

sakridge reviewed May 14, 2020

View reviewed changes

Remove unneeded check

2c9841e

carllin force-pushed the FixRepair branch from 6c5ab97 to 2c9841e Compare May 14, 2020 03:13

carllin mentioned this pull request May 14, 2020

Add nonce to shreds repairs, add shred data size to header #10043

Closed

stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label May 25, 2020

carllin mentioned this pull request May 27, 2020

Track outstanding nonces in repair #10264

Closed

carllin closed this May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nonce to repair request/response #9903

Add nonce to repair request/response #9903

carllin commented May 6, 2020 •

edited

Loading

codecov bot commented May 6, 2020 •

edited

Loading

mvines commented May 8, 2020

mvines commented May 8, 2020

carllin commented May 8, 2020

mvines left a comment

mvines May 9, 2020

carllin May 11, 2020

mvines May 11, 2020

carllin May 14, 2020

mvines commented May 9, 2020

carllin commented May 13, 2020 •

edited

Loading

carllin commented May 14, 2020

sakridge May 14, 2020

mvines commented May 18, 2020

stale bot commented May 25, 2020

		@@ -0,0 +1,271 @@
		use crate::request_response::RequestResponse;

Add nonce to repair request/response #9903

Add nonce to repair request/response #9903

Conversation

carllin commented May 6, 2020 • edited Loading

Problem

Summary of Changes

Backwards Compatibility:

Proposed Upgrade Path:

codecov bot commented May 6, 2020 • edited Loading

Codecov Report

mvines commented May 8, 2020

mvines commented May 8, 2020

carllin commented May 8, 2020

mvines left a comment

Choose a reason for hiding this comment

mvines May 9, 2020

Choose a reason for hiding this comment

carllin May 11, 2020

Choose a reason for hiding this comment

mvines May 11, 2020

Choose a reason for hiding this comment

carllin May 14, 2020

Choose a reason for hiding this comment

mvines commented May 9, 2020

carllin commented May 13, 2020 • edited Loading

carllin commented May 14, 2020

sakridge May 14, 2020

Choose a reason for hiding this comment

mvines commented May 18, 2020

stale bot commented May 25, 2020

carllin commented May 6, 2020 •

edited

Loading

codecov bot commented May 6, 2020 •

edited

Loading

carllin commented May 13, 2020 •

edited

Loading