-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent long delays to inbound peer connection handshakes, Credit: Ziggurat Team #6763
Comments
The Ziggurat team have done some more testing:
This suggests that either the inbound service is blocked, or the entire async thread or executor is blocked. |
I think it's the codec in some cases. Tasks spawned with |
Do you mean "blocked", rather than "idle"? Blocking threads are designed to be blocked on network, filesystem, or other blocking operations, so they should never use much CPU. (That's why we use rayon.) Have you checked the queue depth? About how long is it? |
Have you tried running tokio with more threads? |
Yep. 🤦 that wouldn't cause such a long delay in any event.
Ah, thank you, it's always 0 except at startup.
That would've worked. |
I'm seeing some unusual logs on our 1.0.0-rc.9 testnet node:
Maybe we shouldn't consider block request timeouts to be "missing inventory". Or maybe this is caused by the hangs, and it will go away by itself when we fix them.
These logs should come from separate tasks. This could just be a futures logging bug (using |
The "missing inventory" log is for a different block hash, the first log is probably from a queued block being dropped in the block write task, I don't think they're related.
|
Possibly I coped the wrong log line, but you're right they're a thousand blocks apart.
The block write task returns a different error when it drops a block. So it's either a download or verification timeout. (Or maybe a verifier hang due to unsatisfied verification constraints.) Both block hashes are valid, so neither of those errors should have happened: |
It's a verification timeout: |
That's not what in_current_span() is documented to do: It looks like the inbound connection task span isn't being exited when its future stops being polled. Or we're doing outbound CandidateSet updates from within that task, which could be causing this bug. But I can't see any code that would do that. Does using an async mutex on the CandidateSet causes issues with spans? Is an await calling into other tasks? My first step would be spawning the initial CandidateSet task, so their spans and code are completely isolated. My next step would be instrumenting all the awaits in the listener task with their own extra spans. That might also help us find out which await is blocking (if any). |
Ok, so that either means:
|
I think
Update: It's not a shared span, but |
There's #5125 but this block is far below the final testnet checkpoint.
It's not in the state service, I'll review the checkpoint verifier. |
Was the node behind the block height (2268403) when the request timed out? I don't see anything suspicious in the checkpoint verifier, maybe it was left in I'll double-check the downloader. |
It was stuck just behind that block at a checkpoint:
The logs are all available on the Google Cloud instance.
I would look at the checkpoint verifier or its downloader. We might need to add more info to the "stuck" log, or add a log for stuck checkpoints. |
In PR #6950 I put a 5 second timeout on inbound service requests. Then I did a local test with a 2 second timeout. None of those requests timed out, so either:
We should keep the timeout for security reasons, but we can rule out the inbound service as a source of these hangs. (At least for now, until we get more evidence.) My next step will be wrapping inbound and outbound handshake tasks with the same timeout as the inner handshake future, and logging any timeouts. That will help us work out if the handshakes are blocking somewhere in that small amount of code. (Every handshake should complete or time out within 3-4 seconds, so it's really unusual behaviour to have them hang for 25-40 seconds.) |
This is done in PR #6969, where I discovered that We don't have any evidence that outbound handshakes are hanging, so I didn't put an extra timeout on them yet. |
After testing PR #6969 on a full sync and a synced instance for a day, I saw no inbound service timeouts, and one inbound handshake timeout:
This is acceptable and doesn't need any more changes, it looks like the bug is elsewhere. |
Now I'm seeing some inbound handshake timeouts, but in the outer timeout wrapper. This means that a handshake task took 500ms longer than the timeout inside that task. This should never happen, as long as the cooperative task scheduler is operating efficiently. So that means that the scheduler or the inbound task thread is regularly getting blocked for 500ms or more. There's about one timeout every few hours, which could explain the hangs in this ticket, because the connection needs to be made at the time of a hang to timeout. It's also possible that the hang rate increases as the inbound connection load goes up. Here are my logs, but they're probably not that useful by themselves, I'll go find context later:
Next step is to find the source of the scheduler or thread hangs. It might be enough to look at the log context, I'll try that when I get time. We could also use tokio-console, dtrace, or some similar profiling tool. Ideally we could activate or inspect it slightly before the hang, and then see what Zebra is doing to start and end the hang. |
I haven't moved this to the current sprint (2023 Sprint 13)... is anyone actively looking into this? Otherwise I feel like we should not re-schedule this back in until 2023 Sprint 14 at the earliest |
I need to focus on the RPC work, and I'd like other people to be able to focus as well. I think it's ok to wait for the next release, which includes all the recent fixes to network bugs, and then ask Ziggurat to re-test. |
Motivation
The Ziggurat team have reported some long delays (25-40 seconds) in Zebra's inbound connection handler.
The most likely cause is running blocking code inside async futures on the same thread or task as the connection handler.
The amount of code between accepting an inbound connection and sending a version message is quite small. It's possible the async mutex in the nonce set or reading the state tip from the watch channel are causing delays, but it's unlikely.
Next Diagnostic Steps
Run Zebra under
strace
ordtrace
to work out what it's waiting on during those 40 seconds.Run Zebra with
tokio-console
to see what the blocking task is.Complex Code or Requirements
This seems like a concurrency bug.
Testing
Manually start up Zebra and check how long it takes to accept inbound connections.
Related Work
This issue is possibly related to:
Discussions of this bug
The text was updated successfully, but these errors were encountered: