Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix slowness when syncing near the tip #3375

Closed
conradoplg opened this issue Jan 20, 2022 · 9 comments · Fixed by #3854
Closed

Fix slowness when syncing near the tip #3375

conradoplg opened this issue Jan 20, 2022 · 9 comments · Fixed by #3854
Labels
C-enhancement Category: This is an improvement I-slow Problems with performance or responsiveness

Comments

@conradoplg
Copy link
Collaborator

conradoplg commented Jan 20, 2022

Motivation

It seems that sync becomes slow when it's near the tip. Investigate to find the cause and fix it.

There is some discussion about it in the #slow-sync thread in Discord.

Diagnostics

This is unexpected - the syncer hangs for a few minutes before this log, and there are two chains a long way from the tip:

Feb 25 03:40:26.085 INFO net="Main":sync: zebrad::components::sync: extending tips tips.len=2 in_flight=673 lookahead_limit=2000 state_tip=Some(Height(1577565))

https://github.com/ZcashFoundation/zebra/runs/5325705831?check_suite_focus=true#step:7:10962

Specifications

Designs

Related Work

Possible Causes:

Related Tickets:

@conradoplg conradoplg added C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage I-slow Problems with performance or responsiveness labels Jan 20, 2022
@mpguerra
Copy link
Contributor

@teor2345
Copy link
Contributor

@conradoplg how is this different from ticket #3042?

@conradoplg
Copy link
Collaborator Author

@conradoplg how is this different from ticket #3042?

IMO that's a bug report that may or may not be caused by this issue, or by another, or by a combination of those. I think it's clearer to have this ticket to track this specific issue. After we solve the main slowness tickets we can reach out for the user and check if they still get the same behaviour or not

@conradoplg
Copy link
Collaborator Author

What seems to happen is that it exhausts the prospective tip set and it has to restart the sync, over and over again. (Haven't investigated why)

Jan 28 14:15:49.592  INFO sync: zebrad::components::sync: starting sync, obtaining new tips
Jan 28 14:15:49.592  INFO sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(1545612) min_locator_height=1545513 locators=[Height(1545612), Height(1545611), Height(1545610), Height(1545608), Height(1545604), Height(1545596), Height(1545580), Height(1545548), Height(1545513)]
Jan 28 14:15:53.442  INFO sync: zebrad::components::sync: extending tips tips.len=1 in_flight=397 lookahead_limit=2000
Jan 28 14:15:53.695  INFO sync: zebrad::components::sync: extending tips tips.len=1 in_flight=405 lookahead_limit=2000
Jan 28 14:15:54.363  INFO sync: zebrad::components::sync: exhausted prospective tip set
Jan 28 14:15:54.363  INFO sync: zebrad::components::sync: waiting to restart sync timeout=67s
Jan 28 14:17:01.364  INFO sync: zebrad::components::sync: starting sync, obtaining new tips
Jan 28 14:17:01.365  INFO sync:obtain_tips:state: zebra_state::util: created block locator tip_height=Height(1545745) min_locator_height=1545646 locators=[Height(1545745), Height(1545744), Height(1545743), Height(1545741), Height(1545737), Height(1545729), Height(1545713), Height(1545681), Height(1545646)]
Jan 28 14:17:04.126  INFO sync: zebrad::components::sync: extending tips tips.len=1 in_flight=305 lookahead_limit=2000
Jan 28 14:17:04.533  INFO sync: zebrad::components::sync: exhausted prospective tip set
Jan 28 14:17:04.533  INFO sync: zebrad::components::sync: waiting to restart sync timeout=67s

@teor2345
Copy link
Contributor

I think the notfound changes might fix this, so we should look into it after the following tasks are complete:

@teor2345
Copy link
Contributor

teor2345 commented Mar 9, 2022

This ticket might be made redundant by:

Let's remove it from this sprint, and check again after #3745 has been used for a while.

If it's still a problem, we can do this ticket to make the full sync CI faster.

@dconnolly
Copy link
Contributor

I think I have seen this issue reoccur on #main as of commit 859b67d:

image

image

Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG {zebrad="859b67d" net="Main"}:sync:try_to_sync:extend_tips:download_and_verify{hash=00000000016211f4bc1eb8d321d8ae429f802f228a8ffb4c921cfcef2965449b}:block{height=Some(Height(1594388))}:tx{tx_id=Legacy(transaction::Hash("57e18cd2c5039f81e603e063339bb510ce59ec2ab07ada7c158c1768852fe98f"))}: tower::buffer::worker: service.ready=true processing request
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG {zebrad="859b67d" net="Main"}:sync:try_to_sync:extend_tips:download_and_verify{hash=000000000042b02d79244044020fb60d871b10226d89973b41e38a9b41723d92}:block{height=Some(Height(1594482))}:tx{tx_id=Legacy(transaction::Hash("4ff9f7cc563c280e7b01349d83ce5d2629048d58e88a0b3cec8021ae6e08c464"))}: tower::buffer::worker: service.ready=true processing request
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG {zebrad="859b67d" net="Main"}:sync:try_to_sync:extend_tips:download_and_verify{hash=000000000042b02d79244044020fb60d871b10226d89973b41e38a9b41723d92}:block{height=Some(Height(1594482))}:tx{tx_id=Legacy(transaction::Hash("0eb65b550d636ec2a1e3e8f14307d6b9ab5f02abcca968097ab30d5166d76059"))}: tower::buffer::worker: service.ready=true processing request
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG {zebrad="859b67d" net="Main"}:sync:try_to_sync:extend_tips:download_and_verify{hash=00000000016211f4bc1eb8d321d8ae429f802f228a8ffb4c921cfcef2965449b}:block{height=Some(Height(1594388))}:tx{tx_id=Legacy(transaction::Hash("deac529ab4b9f3c8484ee588d5d7f14bd5c7ae17ffaa41dcc8a252e9cd828f22"))}: tower::buffer::worker: service.ready=true processing request
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG {zebrad="859b67d" net="Main"}:sync:try_to_sync:extend_tips:download_and_verify{hash=000000000042b02d79244044020fb60d871b10226d89973b41e38a9b41723d92}:block{height=Some(Height(1594482))}:tx{tx_id=Legacy(transaction::Hash("4bb4fa812c94e36f1cdb18ff30ff3a3d599d0c80f8a343d487517c1a88b1d8a8"))}: tower::buffer::worker: service.ready=true processing request
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.492 DEBUG {zebrad="859b67d" net="Main"}:sync:try_to_sync:extend_tips:download_and_verify{hash=000000000042b02d79244044020fb60d871b10226d89973b41e38a9b41723d92}:block{height=Some(Height(1594482))}:tx{tx_id=Legacy(transaction::Hash("b15a04dcb910414040acb2a33868e171b5aaaa7244f5452cf96b0be6c764aa0b"))}: tower::buffer::worker: service.ready=true processing request
Mar 23 00:24:37.493 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)
Mar 23 00:24:37.493 DEBUG zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)

@teor2345
Copy link
Contributor

Can you post the info-level logs from before these logs?
It's hard to tell what the syncer is doing here, and how it got here.

zebra_network::peer::client: client struct dropped original_error=Err(AlreadyErrored { original_error: SharedPeerError(ConnectionDropped) }) latest_error=SharedPeerError(ClientDropped)

These repeated ClientDropped errors are slightly concerning.
Does you machine have a bad internet connection?
Or are these errors happening all the time?

What should happen is that the client should get removed by the peer set, then we should stop sending requests to its service. If we're not removing the clients quickly enough, then they could fill up our whole peer set, causing hangs.

This is a completely different error to the original error, so maybe we should open another ticket?

@dconnolly
Copy link
Contributor

I'm having trouble reproducing, I may have to do another full sync to reproduce :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: This is an improvement I-slow Problems with performance or responsiveness
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants