Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent long delays to inbound peer connection handshakes, Credit: Ziggurat Team #6763

Closed
teor2345 opened this issue May 24, 2023 · 22 comments · Fixed by #7103
Closed
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. C-bug Category: This is a bug C-enhancement Category: This is an improvement C-security Category: Security issues I-hang A Zebra component stops responding to requests S-needs-triage Status: A bug report needs triage

Comments

@teor2345
Copy link
Contributor

teor2345 commented May 24, 2023

Motivation

The Ziggurat team have reported some long delays (25-40 seconds) in Zebra's inbound connection handler.

The most likely cause is running blocking code inside async futures on the same thread or task as the connection handler.

The amount of code between accepting an inbound connection and sending a version message is quite small. It's possible the async mutex in the nonce set or reading the state tip from the watch channel are causing delays, but it's unlikely.

Next Diagnostic Steps

Run Zebra under strace or dtrace to work out what it's waiting on during those 40 seconds.
Run Zebra with tokio-console to see what the blocking task is.

Complex Code or Requirements

This seems like a concurrency bug.

Testing

Manually start up Zebra and check how long it takes to accept inbound connections.

Related Work

This issue is possibly related to:

Discussions of this bug

@teor2345 teor2345 added C-bug Category: This is a bug C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage P-Medium ⚡ C-security Category: Security issues I-hang A Zebra component stops responding to requests A-concurrency Area: Async code, needs extra work to make it work properly. labels May 24, 2023
@mpguerra mpguerra added this to Zebra May 24, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Zebra May 24, 2023
@arya2 arya2 self-assigned this May 31, 2023
@teor2345
Copy link
Contributor Author

teor2345 commented Jun 4, 2023

The Ziggurat team have done some more testing:

So one more small related finding - we've set up a synth node which sends GetAddr message requests periodically every few seconds.

What we have also observed in our active connection - once the zebrad enters this stuck state (where it can't handle new inbound connections for a while) - it also stops replying to our GetAddr messages during the same period on this other active connection, we have established.

This suggests that either the inbound service is blocked, or the entire async thread or executor is blocked.

@teor2345 teor2345 changed the title Intermittent long delays to inbound peer connection handshakes Intermittent long delays to inbound peer connection handshakes, Credit: Ziggurat Team Jun 6, 2023
@arya2
Copy link
Contributor

arya2 commented Jun 9, 2023

I think it's the codec in some cases.

Tasks spawned with task::block_in_place() are queued for execution if there are already max_blocking_threads on the runtime even if every spawned blocking thread is idle.

@teor2345
Copy link
Contributor Author

teor2345 commented Jun 9, 2023

I think it's the codec in some cases.

Tasks spawned with task::block_in_place() are queued for execution if there are already max_blocking_threads on the runtime even if every spawned blocking thread is idle.

Do you mean "blocked", rather than "idle"?

Blocking threads are designed to be blocked on network, filesystem, or other blocking operations, so they should never use much CPU. (That's why we use rayon.)

Have you checked the queue depth? About how long is it?
https://docs.rs/tokio/latest/tokio/runtime/struct.RuntimeMetrics.html#method.blocking_queue_depth

@teor2345
Copy link
Contributor Author

teor2345 commented Jun 9, 2023

Have you tried running tokio with more threads?
https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads

@arya2
Copy link
Contributor

arya2 commented Jun 9, 2023

Do you mean "blocked", rather than "idle"?

Yep. 🤦 that wouldn't cause such a long delay in any event.

Have you checked the queue depth? About how long is it?

Ah, thank you, it's always 0 except at startup.

Have you tried running tokio with more threads?

That would've worked.

@teor2345
Copy link
Contributor Author

I'm seeing some unusual logs on our 1.0.0-rc.9 testnet node:

2023-06-11T22:09:25.763933Z WARN {net="Test"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(2268403), hash: block::Hash("000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d") }
2023-06-11T22:16:23.709438Z INFO {net="Test"}:sync:try_to_sync:extend_tips: zebra_network::peer_set::set: all ready peers are missing inventory, failing request hash=Block(block::Hash("000396228eeb43850b30b2441ca9391fb78dc622d6dd5649db1913efbe7dada0"))

Maybe we shouldn't consider block request timeouts to be "missing inventory". Or maybe this is caused by the hangs, and it will go away by itself when we fix them.

2023-06-11T22:18:21.656595Z INFO {net="Test"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:18233)}:listen_accept{peer=In("v4redacted:41812")}:crawl: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:18233)}:listen_accept{peer=In("v4redacted:47028")}: zebra_network::config: updated cached peer IP addresses cached_ip_count=11 peer_cache_file="/root/.cache/zebra/network/testnet.peers

These logs should come from separate tasks. This could just be a futures logging bug (using tracing::instrument incorrectly). But if they are running in the same task, this could be a cause of the hangs.

@arya2
Copy link
Contributor

arya2 commented Jun 12, 2023

I'm seeing some unusual logs on our 1.0.0-rc.9 testnet node:

2023-06-11T22:09:25.763933Z WARN {net="Test"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(2268403), hash: block::Hash("000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d") }
2023-06-11T22:16:23.709438Z INFO {net="Test"}:sync:try_to_sync:extend_tips: zebra_network::peer_set::set: all ready peers are missing inventory, failing request hash=Block(block::Hash("000396228eeb43850b30b2441ca9391fb78dc622d6dd5649db1913efbe7dada0"))

Maybe we shouldn't consider block request timeouts to be "missing inventory". Or maybe this is caused by the hangs, and it will go away by itself when we fix them.

The "missing inventory" log is for a different block hash, the first log is probably from a queued block being dropped in the block write task, I don't think they're related.

2023-06-11T22:18:21.656595Z INFO {net="Test"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:18233)}:listen_accept{peer=In("v4redacted:41812")}:crawl: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:18233)}:listen_accept{peer=In("v4redacted:47028")}: zebra_network::config: updated cached peer IP addresses cached_ip_count=11 peer_cache_file="/root/.cache/zebra/network/testnet.peers

These logs should come from separate tasks. This could just be a futures logging bug (using tracing::instrument incorrectly). But if they are running in the same task, this could be a cause of the hangs.

Probably https://github.com/ZcashFoundation/zebra/blob/main/zebra-network/src/peer_set/initialize.rs#L252?

@teor2345
Copy link
Contributor Author

I'm seeing some unusual logs on our 1.0.0-rc.9 testnet node:

2023-06-11T22:09:25.763933Z WARN {net="Test"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(2268403), hash: block::Hash("000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d") }
2023-06-11T22:16:23.709438Z INFO {net="Test"}:sync:try_to_sync:extend_tips: zebra_network::peer_set::set: all ready peers are missing inventory, failing request hash=Block(block::Hash("000396228eeb43850b30b2441ca9391fb78dc622d6dd5649db1913efbe7dada0"))

Maybe we shouldn't consider block request timeouts to be "missing inventory". Or maybe this is caused by the hangs, and it will go away by itself when we fix them.

The "missing inventory" log is for a different block hash,

Possibly I coped the wrong log line, but you're right they're a thousand blocks apart.

the first log is probably from a queued block being dropped in the block write task, I don't think they're related.

The block write task returns a different error when it drops a block. So it's either a download or verification timeout. (Or maybe a verifier hang due to unsatisfied verification constraints.)

Both block hashes are valid, so neither of those errors should have happened:
https://blockexplorer.one/zcash/testnet/blockHash/000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d
https://blockexplorer.one/zcash/testnet/blockHash/000396228eeb43850b30b2441ca9391fb78dc622d6dd5649db1913efbe7dada0

@arya2
Copy link
Contributor

arya2 commented Jun 12, 2023

So it's either a download or verification timeout. (Or maybe a verifier hang due to unsatisfied verification constraints.)

It's a verification timeout:
https://github.com/ZcashFoundation/zebra/blob/main/zebrad/src/components/sync/downloads.rs#L548

@teor2345
Copy link
Contributor Author

2023-06-11T22:18:21.656595Z INFO {net="Test"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:18233)}:listen_accept{peer=In("v4redacted:41812")}:crawl: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:18233)}:listen_accept{peer=In("v4redacted:47028")}: zebra_network::config: updated cached peer IP addresses cached_ip_count=11 peer_cache_file="/root/.cache/zebra/network/testnet.peers

These logs should come from separate tasks. This could just be a futures logging bug (using tracing::instrument incorrectly). But if they are running in the same task, this could be a cause of the hangs.

Probably main/zebra-network/src/peer_set/initialize.rs#L252?

That's not what in_current_span() is documented to do:
https://docs.rs/tracing/latest/tracing/trait.Instrument.html#method.in_current_span

It looks like the inbound connection task span isn't being exited when its future stops being polled. Or we're doing outbound CandidateSet updates from within that task, which could be causing this bug.

But I can't see any code that would do that. Does using an async mutex on the CandidateSet causes issues with spans? Is an await calling into other tasks?

My first step would be spawning the initial CandidateSet task, so their spans and code are completely isolated.

My next step would be instrumenting all the awaits in the listener task with their own extra spans. That might also help us find out which await is blocking (if any).

@teor2345
Copy link
Contributor Author

So it's either a download or verification timeout. (Or maybe a verifier hang due to unsatisfied verification constraints.)

It's a verification timeout: main/zebrad/src/components/sync/downloads.rs#L548

Ok, so that either means:

  • we're downloading blocks in the wrong order (not strict height order) then hitting a concurrency limit so we can't download the rest before the timeout
  • our state isn't responding to all the UTXOs for the blocks it has (I think this is a known issue)
  • the verifier is blocked in some other way

@arya2
Copy link
Contributor

arya2 commented Jun 12, 2023

But I can't see any code that would do that. Does using an async mutex on the CandidateSet causes issues with spans? Is an await calling into other tasks?

I think listen_fut and peer_cache_updater_fut are both using the current span, so when listen_fut enters listen_accept, it's doing so for the shared span.

tokio::time::sleep(min_inbound_peer_connection_interval).await; at the end of the scope keeps the guard from being dropped, it's not the mutex.

Update: It's not a shared span, but tracing expects the Entered guard to be dropped before the next .await

@arya2
Copy link
Contributor

arya2 commented Jun 12, 2023

our state isn't responding to all the UTXOs for the blocks it has (I think this is a known issue)

There's #5125 but this block is far below the final testnet checkpoint.

the verifier is blocked in some other way

It's not in the state service, I'll review the checkpoint verifier.

@arya2
Copy link
Contributor

arya2 commented Jun 12, 2023

2023-06-11T22:09:25.763933Z WARN {net="Test"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(2268403), hash: block::Hash("000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d") }

we're downloading blocks in the wrong order (not strict height order) then hitting a concurrency limit so we can't download the rest before the timeout

Was the node behind the block height (2268403) when the request timed out?

I don't see anything suspicious in the checkpoint verifier, maybe it was left in QueuedBlocks if the chain tip was still below 2268403 at that point.

I'll double-check the downloader.

@teor2345
Copy link
Contributor Author

our state isn't responding to all the UTXOs for the blocks it has (I think this is a known issue)

There's #5125 but this block is far below the final testnet checkpoint.

The repeated timeouts are tracked by bug #5709

@teor2345
Copy link
Contributor Author

2023-06-11T22:09:25.763933Z WARN {net="Test"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(2268403), hash: block::Hash("000d1acb9eb88a30b1b79d97dc37562f31242ca67b9639549b2ee011fd53515d") }

we're downloading blocks in the wrong order (not strict height order) then hitting a concurrency limit so we can't download the rest before the timeout

Was the node behind the block height (2268403) when the request timed out?

It was stuck just behind that block at a checkpoint:

2023-06-11T22:09:21.656378Z INFO {net="Test"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=95.806% current_height=Height(2268400) network_upgrade=Nu5 remaining_sync_blocks=99307 time_since_last_state_block=7m

The logs are all available on the Google Cloud instance.

I don't see anything suspicious in the checkpoint verifier, maybe it was left in QueuedBlocks if the chain tip was still below 2268403 at that point.

I'll double-check the downloader.

I would look at the checkpoint verifier or its downloader. We might need to add more info to the "stuck" log, or add a log for stuck checkpoints.

@teor2345
Copy link
Contributor Author

In PR #6950 I put a 5 second timeout on inbound service requests. Then I did a local test with a 2 second timeout.

None of those requests timed out, so either:

  • I haven't been able to reproduce the bug locally,
  • the latest version of zebrad fixes it, or
  • the timeouts are not caused by slow inbound service requests.

We should keep the timeout for security reasons, but we can rule out the inbound service as a source of these hangs. (At least for now, until we get more evidence.)

My next step will be wrapping inbound and outbound handshake tasks with the same timeout as the inner handshake future, and logging any timeouts. That will help us work out if the handshakes are blocking somewhere in that small amount of code. (Every handshake should complete or time out within 3-4 seconds, so it's really unusual behaviour to have them hang for 25-40 seconds.)

@teor2345
Copy link
Contributor Author

teor2345 commented Jun 15, 2023

My next step will be wrapping inbound and outbound handshake tasks with the same timeout as the inner handshake future, and logging any timeouts. That will help us work out if the handshakes are blocking somewhere in that small amount of code. (Every handshake should complete or time out within 3-4 seconds, so it's really unusual behaviour to have them hang for 25-40 seconds.)

This is done in PR #6969, where I discovered that outbound TCP connections had no timeout at all.
Edit: that was wrong, they do have a timeout, it's in the service timeout layer

We don't have any evidence that outbound handshakes are hanging, so I didn't put an extra timeout on them yet.

@arya2 arya2 removed their assignment Jun 15, 2023
@teor2345
Copy link
Contributor Author

After testing PR #6969 on a full sync and a synced instance for a day, I saw no inbound service timeouts, and one inbound handshake timeout:

2023-06-15T11:38:38.138980Z INFO {zebrad="b29c9db" net="Test"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:38233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task

This is acceptable and doesn't need any more changes, it looks like the bug is elsewhere.

@teor2345
Copy link
Contributor Author

Now I'm seeing some inbound handshake timeouts, but in the outer timeout wrapper.

This means that a handshake task took 500ms longer than the timeout inside that task. This should never happen, as long as the cooperative task scheduler is operating efficiently. So that means that the scheduler or the inbound task thread is regularly getting blocked for 500ms or more.

There's about one timeout every few hours, which could explain the hangs in this ticket, because the connection needs to be made at the time of a hang to timeout. It's also possible that the hang rate increases as the inbound connection load goes up.

Here are my logs, but they're probably not that useful by themselves, I'll go find context later:

2023-06-16T12:49:22.776673Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-16T15:37:22.776987Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T02:27:22.777531Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T05:29:22.777631Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T06:42:22.778207Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T06:52:22.777703Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T08:02:22.777531Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T09:40:22.778041Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T12:42:22.778240Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-17T21:04:22.778398Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-18T01:35:22.778959Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-18T10:38:22.779105Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-18T14:32:22.779325Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-18T18:32:22.779451Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-19T05:29:22.787021Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-19T22:18:22.787581Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already
2023-06-19T23:28:22.787793Z  INFO {zebrad="23cb29a" net="Main"}:accept_inbound_connections{min_inbound_peer_connection_interval=1s listener_addr=Ok(0.0.0.0:28233)}: zebra_network::peer_set::initialize: timeout in spawned accept_inbound_handshake() task: inner task should have timeout out already

Next step is to find the source of the scheduler or thread hangs. It might be enough to look at the log context, I'll try that when I get time.

We could also use tokio-console, dtrace, or some similar profiling tool. Ideally we could activate or inspect it slightly before the hang, and then see what Zebra is doing to start and end the hang.

@mpguerra
Copy link
Contributor

I haven't moved this to the current sprint (2023 Sprint 13)... is anyone actively looking into this? Otherwise I feel like we should not re-schedule this back in until 2023 Sprint 14 at the earliest

@teor2345
Copy link
Contributor Author

I need to focus on the RPC work, and I'd like other people to be able to focus as well.

I think it's ok to wait for the next release, which includes all the recent fixes to network bugs, and then ask Ziggurat to re-test.

@mergify mergify bot closed this as completed in #7103 Jun 30, 2023
mergify bot pushed a commit that referenced this issue Jun 30, 2023
…a CPU busy-loop, Credit: Ziggurat Team (#6763), james_katz (#7000) (#7103)

* Stop busy-waiting in a Future for 45 seconds every minute

* Use the correct elapsed time calculation

* Add some TODOs for making the structure of the loop and wait times clearer
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Zebra Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. C-bug Category: This is a bug C-enhancement Category: This is an improvement C-security Category: Security issues I-hang A Zebra component stops responding to requests S-needs-triage Status: A bug report needs triage
Projects
Archived in project
3 participants