-
Notifications
You must be signed in to change notification settings - Fork 61
Timeout on MacOS after making ~16384 RPC calls #67
Comments
That repo crate should hit the bug when running using |
Yup, that's right. I also tried it on an Ubuntu 21.10 server, and wasn't able to reproduce it there. |
For reference, I added a GitHub actions workflow that passes on Ubuntu and fails on MacOS. |
Good bug reporting effort bro, thanks! |
Between 0.11 and 0.12 we completely replaced hyper with a homebrew http transport. Probably we are leaking some resource. |
It's always around 16384 which seems suspicious. Possibly the resource being leaked is a socket or file handle? |
Yep -- those are my guesses. I'll try to poke at this later today, although I don't have a Mac so it might be a bit annoying. On your Mac could you try running your test in |
I actually tried to, but I'm on a recent mac OS with security features that prevent dtrace (the strace equivalent available on macos) from running, unless I reboot into recovery mode and run some scary commands. I have another mac that I can do this on, but I'd need to wipe it first. |
After reading the code (and using strace) I'm fairly confident there are no resource leaks. I suspect that on Mac you just aren't allowed to open 2^16 sockets even if you close the old ones (at least, you can't do this in very rapid succession). The previous code used Hyper, which just maintained a socket pool. We should probably do the same (well, we'd have a single socket rather than a pool .. but still we should cache it). The reason we didn't do this in the past is that TCP sockets suck -- they can go stale and you can't tell they're dead until you actually try sending and then you've gotta re-try, so you better not have handed ownership over any objects to the serialization code. None of this is a big deal but we were trying to go for maximal simplicity. I don't think this'll be too hard to fix, especially as we have the old hyper code lying around which tells us which particular io errors mean "broken pipe, just try again" and which ones are real errors. |
This sounds right. It might have to do with open ports, since each outgoing socket will allocate a local ephemeral port.
Another option to consider making the back-end pluggable. My application already uses hyper, so there aren't any additional security/bloat considerations if I use it in a dependency. All this makes me wish that Bitcoin Core had a simple binary RPC API over a local FIFO, so that if you didn't need remote access, you didn't have to think about HTTP, security, ports, sockets. |
More digging, and it looks like it's the ephemeral port range that's being exhausted. MacOS's ephemeral port range is 49152 to 65535, which is 16383 ports. If reusing the socket is hard and annoying, then reusing the port might be a lower impact option. |
The back-end is actually pluggable -- we have a But I think just reusing the socket would be easier/not too bad. It's on my list. |
I suggest trying to set SO_REUSEADDR on the sockets, and possibly also disable SO_LINGER -- this may help if the issue is you're trying to reuse ports too fast, while the old socket is still in TIME_WAIT (closed but not fully cleaned up yet.) I don't know why it's only seen on mac -- one possibility is that Linux only applies this to sockets bound to a specific port, and mac is also applying it to ephemeral ones. (Or maybe the mac just has a much smaller legal range for ephemeral ports.) (Obviously just reusing the socket is better, if you're making a ton of calls in rapid succession. But this might be a lot easier to try.) |
@casey are you able to try this locally? I don't have a Mac to play with. |
@apoelstra I should be able to fool around with this tomorrow . |
Narrator: “He didn’t have time to fool around with it tomorrow.” Definitely early next week though! |
We just tried setting SO_REUSEADDR, SO_REUSEPORT, and disabling SO_LINGER, and they don't seem to have helped. Reusing the socket might be the best option. |
Surprising! But ok, we'll reuse the socket (probably wise regardless). |
e867374 Reuse HTTP connection (raphjaph) Pull request description: This is a quick and dirty attempt at fixing #67 that reuses the socket for multiple requests. It has few issues: - no error handing, client becomes unusable after encountering a single error - `SimpleHttpTransport::request()` takes a `mut` reference to self, which cascades to all downstream users - `TcpStream` is not `Clone`, therefore I had to remove the derive `Clone` An alternate implementation based on `reqwest` (`hyper` seems too low-level because of the async stuff) wouldn't have these problems. So I think I'm going to start on that. Just wanted so share this draft PR if anyone has feedback or has an idea on how to do this without `reqwest`. ACKs for top commit: apoelstra: ACK e867374 Tree-SHA512: 470e740b7bff7e323bdce014e73e44228e88ed332b3cb49d01302a9d591e51a1ffb25bd3cbcb6413090d89a627e289a04685c6dea615b968190a36635450d486
Fixed by #72 |
I originally hit this while using
rust-bitcoincore-rpc
, but was able to narrow it down to this crate.The error is:
I uploaded a repo containing a self-contained reproduction here: https://github.com/casey/bitcoind-rpc-timeout/blob/master/src/main.rs
The issue was introduced some time between 0.11 and 0.12 and is still present on master. I tried to bisect it down, but there was a lot of flux in the crate, which made it impossible to track down a specific revision.
While debugging, I would sometimes get this error:
Which makes me think it's related to this issue: rust-bitcoin/rust-bitcoincore-rpc#211
And this issue: bitcoin/bitcoin#25724 (comment)
Although I wouldn't necessarily bet on it.
The text was updated successfully, but these errors were encountered: