Tracking: sync correctness #884

hdevalence · 2020-08-11T23:45:23Z

Tracking issue for issues around sync correctness, getting confused downloading blocks, etc.

First Alpha

Hangs and Panics:

Fix network panic (Panic: preselected index must be valid in peer_set #1183)
Fix state panic (Panic: hashes must be unique in memory_state #1182)
Diagnose and fix mainnet sync hangs (Sync hang for 22 minutes on mainnet after checkpoint error #1181)
Stop deadlocks between the sync service and checkpoint verifier by adding a timeout on block tasks
Stop hangs in ObtainTips and ExtendTips by adding a timeout to tip requests

Logging:

Fix or hide Failed to write to hedge histogram: ValueOutOfRangeResizeDisabled logs (issue: hedge middleware has fixed size for timing histogram tower-rs/tower#475, fix: hedge: use auto-resizing histograms tower-rs/tower#484)

Performance:

Reduce Zebra's memory usage
- tuning RocksDB (Tune RocksDB memory usage #1486)
- limiting queue lengths (Security: Handle gossiped blocks that are a long way ahead of the current tip #1389)

Duplicate Downloads:

Fix Duplicate/AlreadyVerified errors in the checkpoint verifier (Repeated Duplicate errors in the CheckpointVerifier #1259)
- Reduce the AlreadyVerified log level to trace
Cancel download and verify tasks if the sync service has dropped the channel (Cancel download and verify tasks if the sync service has dropped the channel #1037)
Deduplicate download and verify tasks, by using a separate service to handle those tasks to completion - Add a download set. #1041, see also fix: Ignore sync errors when the block is already verified #980 (comment)
Make tip deduplication order-independent

Cleanup:

Remove redundant ObtainTips retry Remove redundant ObtainTips retry #1033
Remove the out of order block debug logs
- Depends on out of order block handling in the state (Re-enable the BlockVerifier continuous_blockchain_test #1011) and inbound network service (Tracking: inbound message handling. #889, inv messages)
- For details, see Remove the zebra-consensus AddBlock order workarounds #1012 (comment)

Reduce Sync Restarts:

Hedge slow downloads by starting a duplicate request after a delay WIP: use hedge middleware #1089
Skip the final hash in each set of BlockHashes, due to a zebrad bug
Increase default EWMA RTT time, so peers have to prove they are fast (this makes a huge difference) Increase sync timeouts #993
Ignore errors for blocks which are already in the state
Tweak and document sync timeouts Increase sync timeouts #993

Work out if we need to remove duplicate downloads using:

the list of hashes from earlier queries in the current ObtainTips or ExtendTips (download_set) [Yes, as done in the current code]
the list of hashes that are being downloaded and verified in spawned futures [No, because we deduplicate in each extension and we check that each extension is an extension]
the list of hashes that have failed download or verify recently [No, because failures cause us to restart with clean sync state]
the hashes of verified blocks in the state [Only in obtain_tips, as done in the current code].

Design Questions:

Work out how to get each new block as it is mined, when we have synced to the tip
- In a79ce97 and Fix sync algorithm. #887, we skip responses that only extend by 1 block:
  - ObtainTips: https://github.com/ZcashFoundation/zebra/blob/main/zebrad/src/commands/start/sync.rs#L219
  - ExtendTips: https://github.com/ZcashFoundation/zebra/blob/main/zebrad/src/commands/start/sync.rs#L296
- See Tracking: inbound message handling. #889, where we get new blocks through the gossip protocol.
Decide if the BLOCK_TIMEOUT is too low for slow or high-latency networks
- Work out if we can run Zebra over Tor [No, it's bad for Tor, and slow for Zebra]

Database:

Implement the sled tree changes in the design RFC RFC: state updates #902, section Notes on Sled Trees

Future Releases

Documentation:

Document minimum network requirements for Zebra
- a minimum bandwidth of 10 Mbps
- a maximum latency of 1 second
- we don't support running Zebra over Tor
Document the peerset target size config, as a useful option for bandwidth-constrained nodes
- test the minimum supported peerset target size, and add it to the docs (it's probably 4-12)
Check Zebra's memory usage during sync, and document the minimum requirement
Update the sync RFC to document the new sync algorithm Retcon new sync logic into RFC1 #899
- See previous changes in a79ce97 and Fix sync algorithm. #887

Performance improvements:

Work out how to improve genesis and post-restart sync latency, particularly on Testnet
Deploy extra zcashd instances on Testnet (Deploy more Zebra or zcashd instances on testnet #1222)

Possible Future Work:

Consider disconnecting from peers that return bad responses
Consider disconnecting from peers that are sending blocks that we no longer want
Churn peers, using a regular timer to disconnect a random peer
Analyse the bandwidth and latency of current Zcash mainnet and testnet peers
Create a peer reputation service
Refactor out the core sync algorithms, and write unit tests for them Write unit tests for ObtainTips and ExtendTips #730

The text was updated successfully, but these errors were encountered:

hdevalence · 2020-08-13T00:37:14Z

@teor2345 I think we don't need the sync to be able to go forward a block at a time, since as new blocks are generated we'll get them through gossip (to be implemented in #889).

hdevalence · 2020-08-13T00:52:02Z

On the BLOCK_TIMEOUT question, there are actually two timeouts -- the one applied in the timeout layer and the one used internally in the peer set's state machine (set to 10 seconds). So if we want to have timeouts greater than 10 seconds we'll need to update that timeout as well. With a 1-second latency allowance that would be 1.8 Mbps. Maybe that's too high, although I don't think that it's unreasonable for the server use case that Zebra aims for.

I think it would be fine to increase the timeout to a somewhat larger value, and increase the lookahead limit to compensate. What's a reasonable target speed?

teor2345 · 2020-08-13T01:12:37Z

@teor2345 I think we don't need the sync to be able to go forward a block at a time, since as new blocks are generated we'll get them through gossip (to be implemented in #889).

Sounds good, and I guess it's ok to lag slightly in our first release. I'll update that TODO.

teor2345 · 2020-08-13T01:22:28Z

On the BLOCK_TIMEOUT question, there are actually two timeouts -- the one applied in the timeout layer and the one used internally in the peer set's state machine (set to 10 seconds). So if we want to have timeouts greater than 10 seconds we'll need to update that timeout as well. With a 1-second latency allowance that would be 1.8 Mbps. Maybe that's too high, although I don't think that it's unreasonable for the server use case that Zebra aims for.

I think it would be fine to increase the timeout to a somewhat larger value, and increase the lookahead limit to compensate. What's a reasonable target speed?

I'm not sure if we need to make any changes here.

Requiring less than 10 Mbps is reasonable for servers. And a latency of 1 second should cover most servers, wifi, and even some mobile internet. But I don't know the exact answer, because download speed also depends on peer speeds and peer latency. We should do some testing 🤔

Do we expect people to be able to run Zebra over Tor?
Because that will decrease our bandwidth, and significantly increase our latency:

For Tor via exits, it seems like 6 seconds should be enough for most blocks:
https://metrics.torproject.org/torperf.html?start=2020-05-15&end=2020-08-13&server=public&filesize=1mb

For Tor via onion services, we'd need to allow 20 seconds per block:
https://metrics.torproject.org/torperf.html?start=2020-05-15&end=2020-08-13&server=onion&filesize=1mb

(Note that these are the 1 MB graphs, there are also 5 MB graphs, but we only need 2 MB.)

hdevalence · 2020-08-13T17:46:26Z

Sounds good, and I guess it's ok to lag slightly in our first release. I'll update that TODO.

To clarify, I don't think we'll be lagging here -- in initial block sync, we'll sync up to the current chain tip, and we'll stay in sync through block gossip.

hdevalence · 2020-08-13T17:50:25Z

I'm not sure if we need to make any changes here.

Okay, let's do nothing for now. I don't think that we can run Zebra over Tor anyways, since the Bitcoin protocol isn't really compatible with Tor.

hdevalence · 2020-08-13T17:54:13Z

Updated the list of checks with rationale about sync behavior.

teor2345 · 2020-08-31T10:16:55Z

I updated the list of outstanding issues, I'll check them off as I submit PRs.

Even after these changes, Zebra still doesn't limit the number of requests which are waiting for peer responses, or handle peer reputation. So I've added them to the "future work" section.

hdevalence · 2020-08-31T19:51:56Z

I don't think it makes sense for us to limit the number of requests waiting for peer responses, because I don't think that there's a way for us to work out in a principled way what an appropriate limit for the number of in-flight requests should be. Instead I think it would be better to use backpressure from the peer set to block new requests when the peer set is at capacity, which is what we already do. If this isn't working well, I would prefer to try to figure out why, and how we could improve it. Perhaps this could be addressed by setting the peer set's target size to a smaller parameter.

teor2345 · 2020-08-31T22:29:46Z

Yeah, that was basically the conclusion I came to after thinking about it a bit more.

It might also be something we want to document for users - if they are on a slower network, try setting the target peerset size lower.

teor2345 · 2020-11-12T23:00:46Z

I have split this tracking issue into "First Alpha" and "Future Releases".

mpguerra · 2020-11-23T11:13:36Z

What is left here for the Alpha release now?

I have removed #1183 from the Alpha milestone and #1182 is done.

Do we also want to implement the logging tickets?
How much of the testing (if any) has been done?

teor2345 · 2020-11-23T11:41:17Z

These bugs make testing really difficult:

network panic Panic: preselected index must be valid in peer_set #1183 - this panic still happens on the latest main branch
hedge middleware has fixed size for timing histogram tower-rs/tower#475 - the large number of redundant error logs make it difficult to see the Zebra log messages that are actually important

I still can't get Zebra to reliably sync to the tip. So we haven't done enough testing yet - we need to do a full sync after we've fixed all the common errors.

teor2345 · 2020-11-23T11:55:02Z

@hdevalence said on discord that tower-rs/tower#475 is fixed by tower-rs/tower#484.

For some reason the ticket is still open.

mpguerra · 2020-11-25T10:34:45Z

Can we move this out of Alpha Release milestone now that #1183 is closed?

teor2345 · 2020-11-30T11:04:01Z

Can we move this out of Alpha Release milestone now that #1183 is closed?

What should we do about these first alpha testing tasks?

Manually run Zebra on Mainnet to verify that the sync works after recent fixes
- Optional: also run Zebra on Testnet

We're still making changes that might affect sync behaviour.

mpguerra · 2020-11-30T11:32:31Z

These sound to me more like ongoing checks that should be done by developers to validate their changes whenever they make changes to the sync code (or changes that may affect it) rather than finite one-time only tasks.

If these checks can't be automated my suggestion would be to add these to the PR template as a reminder for developers. Otherwise, if we can automate, let's create a ticket to implement these checks in our CI after the alpha.

oxarbitrage · 2020-11-30T14:10:02Z

I think it will be good to document somewhere(maybe in the main README) machine specifications to sync the mainnet up to current tip. This is recommended RAM, disk space, CPU, etc. With those specs we can document an estimated time to be in sync, something like "at the moment of writing, with the recommended specs, sync up to block XXXXX will take around XX time" This way developers and users can rent appropriate services or try locally and know what to expect.

mpguerra · 2020-11-30T14:40:36Z

I think it will be good to document somewhere(maybe in the main README) machine specifications to sync the mainnet up to current tip. This is recommended RAM, disk space, CPU, etc. With those specs we can document an estimated time to be in sync, something like "at the moment of writing, with the recommended specs, sync up to block XXXXX will take around XX time" This way developers and users can rent appropriate services or try locally and know what to expect.

We have some similar details in #1374

oxarbitrage · 2020-11-30T15:24:14Z

I see, sorry i didn't saw that. Thanks.

teor2345 · 2020-12-01T02:23:21Z

There isn't anything specific to be done for "testing on mainnet", so I removed those items, and moved this ticket out of the first alpha release.

mpguerra · 2021-01-05T10:48:50Z

adding #862 to this epic

teor2345 · 2021-01-05T21:31:40Z

adding #862 to this epic

That ticket modifies the state service interface, which impacts sync and verification. So it doesn't really belong in sync correctness.

The state service cleanup in #1302 might be a better place for it.

mpguerra · 2022-01-21T13:46:07Z

I would like us to review this tracking issue with a view to prioritizing outstanding tasks and potentially creating issues for those we still want to do so that we can close this issue.

If we decide not to close it yet, I think it probably should not be an epic but should be a github tracking issue and use github's task list to track progress

conradoplg · 2022-01-26T22:02:08Z

teor2345 · 2022-01-26T23:25:17Z

* [x]  Document minimum network requirements for Zebra
  
  * [ ]  a maximum latency of 1 second [is it?]

It's effectively a 4 second RTT, I added this documentation task to #3101:

zebra/zebra-network/src/constants.rs

Line 54 in f6de7fa

pub const HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(4);

* [ ]  Document the peerset target size config, as a useful option for bandwidth-constrained nodes [maybe part of [Document how to speed up full validation in the README #3101](https://github.com/ZcashFoundation/zebra/issues/3101) ]

I added this task to #3101

  * [ ]  test the minimum supported peerset target size, and add it to the docs (it's probably 4-12) [do we know now?]

We already have #704 for this, and the minimum size might change as we make fixes to Zebra. So we can do this task when Zebra is stable. (And we add tests for the minimum size.)

* [ ]  Update the sync RFC to document the new sync algorithm [Retcon new sync logic into RFC1 #899](https://github.com/ZcashFoundation/zebra/issues/899) [it seems this doesn't exist anymore? We should document the sync algorithm anyway, I'd like to create an issue for this, but it's not a priority]

We can do this documentation task whenever we have time.

Performance improvements:

* [ ]  Work out how to improve genesis and post-restart sync latency, particularly on Testnet [do we need this? Seems to not be an issue anymore]

It seems to be fixed, the post-restart delay is possibly too long, but restarts are very infrequent, so it doesn't matter. (And there's a tradeoff with restart loops on congested networks.)

Possible Future Work:

* [ ]  Consider disconnecting from peers that return bad responses [it seems we shouldn't do this? [Security: Stop disconnecting from nodes that send unexpected messages, to prevent disconnection attacks, Credit: Equilibrium #2107](https://github.com/ZcashFoundation/zebra/issues/2107)]

Yes, this is obsoleted and the security issue was fixed.

* [ ]  Consider disconnecting from peers that are sending blocks that we no longer want [[Security: Stop disconnecting from nodes that send unexpected messages, to prevent disconnection attacks, Credit: Equilibrium #2107](https://github.com/ZcashFoundation/zebra/issues/2107) applies too? i.e. we shouldn't do this]

Yes, this is obsoleted and the security issue was fixed.

* [ ]  Churn peers, using a regular timer to disconnect a random peer [don't know if we still need this]

I added this to:

Tracking: Questions for Network Audit #3247

* [ ]  Analyse the bandwidth and latency of current Zcash mainnet and testnet peers [don't know if we still need this]

If we need this, we should get someone else to do it for us. But it doesn't seem important right now.

* [ ]  Create a peer reputation service [don't know if we still need this]

I added this to:

Tracking: Questions for Network Audit #3247

* [ ]  Refactor out the core sync algorithms, and write unit tests for them [Write unit tests for ObtainTips and ExtendTips #730](https://github.com/ZcashFoundation/zebra/issues/730) [there's already an issue for it so no need to do anything]

After all those changes, I think we can close this issue.

hdevalence added E-med C-tracking-issue Category: This is a tracking issue for other tasks labels Aug 11, 2020

dconnolly mentioned this issue Aug 11, 2020

Tracking: sync and validate mainnet #619

Closed

6 tasks

teor2345 mentioned this issue Aug 13, 2020

Fix sync algorithm. #887

Merged

dconnolly added this to the Sync and validate zcash mainnet milestone Aug 13, 2020

This was referenced Sep 4, 2020

Tracking: Testing the Entire Chain #745

Closed

Remove the zebra-consensus AddBlock order workarounds #1012

Merged

teor2345 modified the milestones: Validate blocks, Sync and Network Sep 29, 2020

teor2345 modified the milestones: Sync and Network, First Alpha Release Oct 8, 2020

teor2345 mentioned this issue Oct 27, 2020

Pipelinable Block Syncing and Lookup RFC fixes #1218

Closed

teor2345 mentioned this issue Nov 12, 2020

Revert "Hedge every syncer block download request" #1284

Merged

mpguerra added the Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped label Nov 17, 2020

teor2345 modified the milestones: First Alpha Release, First stable / major version release 📦 Dec 1, 2020

mpguerra removed the E-med label Mar 23, 2021

mpguerra removed this from the First stable / major version release 📦 milestone May 28, 2021

mpguerra removed the Epic Zenhub Label. Denotes a theme of work under which related issues will be grouped label Jan 25, 2022

teor2345 closed this as completed Jan 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: sync correctness #884

Tracking: sync correctness #884

hdevalence commented Aug 11, 2020 •

edited by mpguerra

Loading

hdevalence commented Aug 13, 2020

hdevalence commented Aug 13, 2020

teor2345 commented Aug 13, 2020

teor2345 commented Aug 13, 2020

hdevalence commented Aug 13, 2020

hdevalence commented Aug 13, 2020

hdevalence commented Aug 13, 2020

teor2345 commented Aug 31, 2020

hdevalence commented Aug 31, 2020

teor2345 commented Aug 31, 2020

teor2345 commented Nov 12, 2020

mpguerra commented Nov 23, 2020

teor2345 commented Nov 23, 2020 •

edited

Loading

teor2345 commented Nov 23, 2020

mpguerra commented Nov 25, 2020

teor2345 commented Nov 30, 2020

mpguerra commented Nov 30, 2020

oxarbitrage commented Nov 30, 2020

mpguerra commented Nov 30, 2020

oxarbitrage commented Nov 30, 2020

teor2345 commented Dec 1, 2020

mpguerra commented Jan 5, 2021

teor2345 commented Jan 5, 2021

mpguerra commented Jan 21, 2022

conradoplg commented Jan 26, 2022

teor2345 commented Jan 26, 2022

Tracking: sync correctness #884

Tracking: sync correctness #884

Comments

hdevalence commented Aug 11, 2020 • edited by mpguerra Loading

First Alpha

Future Releases

hdevalence commented Aug 13, 2020

hdevalence commented Aug 13, 2020

teor2345 commented Aug 13, 2020

teor2345 commented Aug 13, 2020

hdevalence commented Aug 13, 2020

hdevalence commented Aug 13, 2020

hdevalence commented Aug 13, 2020

teor2345 commented Aug 31, 2020

hdevalence commented Aug 31, 2020

teor2345 commented Aug 31, 2020

teor2345 commented Nov 12, 2020

mpguerra commented Nov 23, 2020

teor2345 commented Nov 23, 2020 • edited Loading

teor2345 commented Nov 23, 2020

mpguerra commented Nov 25, 2020

teor2345 commented Nov 30, 2020

mpguerra commented Nov 30, 2020

oxarbitrage commented Nov 30, 2020

mpguerra commented Nov 30, 2020

oxarbitrage commented Nov 30, 2020

teor2345 commented Dec 1, 2020

mpguerra commented Jan 5, 2021

teor2345 commented Jan 5, 2021

mpguerra commented Jan 21, 2022

conradoplg commented Jan 26, 2022

teor2345 commented Jan 26, 2022

hdevalence commented Aug 11, 2020 •

edited by mpguerra

Loading

teor2345 commented Nov 23, 2020 •

edited

Loading