-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking: sync correctness #884
Comments
On the I think it would be fine to increase the timeout to a somewhat larger value, and increase the lookahead limit to compensate. What's a reasonable target speed? |
I'm not sure if we need to make any changes here. Requiring less than 10 Mbps is reasonable for servers. And a latency of 1 second should cover most servers, wifi, and even some mobile internet. But I don't know the exact answer, because download speed also depends on peer speeds and peer latency. We should do some testing 🤔 Do we expect people to be able to run Zebra over Tor? For Tor via exits, it seems like 6 seconds should be enough for most blocks: For Tor via onion services, we'd need to allow 20 seconds per block: (Note that these are the 1 MB graphs, there are also 5 MB graphs, but we only need 2 MB.) |
To clarify, I don't think we'll be lagging here -- in initial block sync, we'll sync up to the current chain tip, and we'll stay in sync through block gossip. |
Okay, let's do nothing for now. I don't think that we can run Zebra over Tor anyways, since the Bitcoin protocol isn't really compatible with Tor. |
Updated the list of checks with rationale about sync behavior. |
I updated the list of outstanding issues, I'll check them off as I submit PRs. Even after these changes, Zebra still doesn't limit the number of requests which are waiting for peer responses, or handle peer reputation. So I've added them to the "future work" section. |
I don't think it makes sense for us to limit the number of requests waiting for peer responses, because I don't think that there's a way for us to work out in a principled way what an appropriate limit for the number of in-flight requests should be. Instead I think it would be better to use backpressure from the peer set to block new requests when the peer set is at capacity, which is what we already do. If this isn't working well, I would prefer to try to figure out why, and how we could improve it. Perhaps this could be addressed by setting the peer set's target size to a smaller parameter. |
Yeah, that was basically the conclusion I came to after thinking about it a bit more. It might also be something we want to document for users - if they are on a slower network, try setting the target peerset size lower. |
I have split this tracking issue into "First Alpha" and "Future Releases". |
These bugs make testing really difficult:
I still can't get Zebra to reliably sync to the tip. So we haven't done enough testing yet - we need to do a full sync after we've fixed all the common errors. |
@hdevalence said on discord that tower-rs/tower#475 is fixed by tower-rs/tower#484. For some reason the ticket is still open. |
Can we move this out of Alpha Release milestone now that #1183 is closed? |
What should we do about these first alpha testing tasks?
We're still making changes that might affect sync behaviour. |
These sound to me more like ongoing checks that should be done by developers to validate their changes whenever they make changes to the sync code (or changes that may affect it) rather than finite one-time only tasks. If these checks can't be automated my suggestion would be to add these to the PR template as a reminder for developers. Otherwise, if we can automate, let's create a ticket to implement these checks in our CI after the alpha. |
I think it will be good to document somewhere(maybe in the main README) machine specifications to sync the mainnet up to current tip. This is recommended RAM, disk space, CPU, etc. With those specs we can document an estimated time to be in sync, something like "at the moment of writing, with the recommended specs, sync up to block XXXXX will take around XX time" This way developers and users can rent appropriate services or try locally and know what to expect. |
We have some similar details in #1374 |
I see, sorry i didn't saw that. Thanks. |
There isn't anything specific to be done for "testing on mainnet", so I removed those items, and moved this ticket out of the first alpha release. |
adding #862 to this epic |
I would like us to review this tracking issue with a view to prioritizing outstanding tasks and potentially creating issues for those we still want to do so that we can close this issue. If we decide not to close it yet, I think it probably should not be an epic but should be a github tracking issue and use github's task list to track progress |
I've added some comments below (in brackets) about the unfinished items, but most of them I'm not sure what to do. @teor2345 could you please take a look?
Performance improvements:
Possible Future Work:
|
It's effectively a 4 second RTT, I added this documentation task to #3101: zebra/zebra-network/src/constants.rs Line 54 in f6de7fa
I added this task to #3101
We already have #704 for this, and the minimum size might change as we make fixes to Zebra. So we can do this task when Zebra is stable. (And we add tests for the minimum size.)
We can do this documentation task whenever we have time.
It seems to be fixed, the post-restart delay is possibly too long, but restarts are very infrequent, so it doesn't matter. (And there's a tradeoff with restart loops on congested networks.)
Yes, this is obsoleted and the security issue was fixed.
Yes, this is obsoleted and the security issue was fixed.
I added this to:
If we need this, we should get someone else to do it for us. But it doesn't seem important right now.
I added this to:
After all those changes, I think we can close this issue. |
Tracking issue for issues around sync correctness, getting confused downloading blocks, etc.
First Alpha
Hangs and Panics:
Logging:
Failed to write to hedge histogram: ValueOutOfRangeResizeDisabled
logs (issue: hedge middleware has fixed size for timing histogram tower-rs/tower#475, fix: hedge: use auto-resizing histograms tower-rs/tower#484)Performance:
Duplicate Downloads:
Duplicate
/AlreadyVerified
errors in the checkpoint verifier (Repeated Duplicate errors in the CheckpointVerifier #1259)AlreadyVerified
log level to traceCleanup:
Reduce Sync Restarts:
BlockHashes
, due to a zebrad bugWork out if we need to remove duplicate downloads using:
obtain_tips
, as done in the current code].Design Questions:
Database:
Notes on Sled Trees
Future Releases
Documentation:
Performance improvements:
zcashd
instances on Testnet (Deploy more Zebra or zcashd instances on testnet #1222)Possible Future Work:
Consider disconnecting from peers that return bad responses
Consider disconnecting from peers that are sending blocks that we no longer want
Churn peers, using a regular timer to disconnect a random peer
Analyse the bandwidth and latency of current Zcash mainnet and testnet peers
Create a peer reputation service
Refactor out the core sync algorithms, and write unit tests for them Write unit tests for ObtainTips and ExtendTips #730
The text was updated successfully, but these errors were encountered: