Skip to content
This repository has been archived by the owner on Sep 4, 2024. It is now read-only.

Timeout on MacOS after making ~16384 RPC calls #67

Closed
casey opened this issue Aug 5, 2022 · 19 comments
Closed

Timeout on MacOS after making ~16384 RPC calls #67

casey opened this issue Aug 5, 2022 · 19 comments

Comments

@casey
Copy link
Contributor

casey commented Aug 5, 2022

I originally hit this while using rust-bitcoincore-rpc, but was able to narrow it down to this crate.

The error is:

Transport(SocketError(Error { kind: TimedOut, message: "connection timed out" }))

I uploaded a repo containing a self-contained reproduction here: https://github.com/casey/bitcoind-rpc-timeout/blob/master/src/main.rs

The issue was introduced some time between 0.11 and 0.12 and is still present on master. I tried to bisect it down, but there was a lot of flux in the crate, which made it impossible to track down a specific revision.

While debugging, I would sometimes get this error:

Transport(SocketError(Os { code: 11, kind: WouldBlock, message: "Resource temporarily unavailable" }))

Which makes me think it's related to this issue: rust-bitcoin/rust-bitcoincore-rpc#211
And this issue: bitcoin/bitcoin#25724 (comment)

Although I wouldn't necessarily bet on it.

@tcharding
Copy link
Collaborator

tcharding commented Aug 5, 2022

That repo crate should hit the bug when running using cargo run, right? Just posting here to confirm the MacOS in the title, on Ubuntu 20.04 the bug does not appear to be triggered.

@casey
Copy link
Contributor Author

casey commented Aug 5, 2022

That repo crate should hit the bug when running using cargo run, right? Just posting here to confirm the MacOS in the title, on Ubuntu 20.04 the bug does not appear to be triggered.

Yup, that's right. I also tried it on an Ubuntu 21.10 server, and wasn't able to reproduce it there.

@casey
Copy link
Contributor Author

casey commented Aug 5, 2022

For reference, I added a GitHub actions workflow that passes on Ubuntu and fails on MacOS.

@tcharding
Copy link
Collaborator

Good bug reporting effort bro, thanks!

@apoelstra
Copy link
Owner

Between 0.11 and 0.12 we completely replaced hyper with a homebrew http transport. Probably we are leaking some resource.

@casey
Copy link
Contributor Author

casey commented Aug 5, 2022

It's always around 16384 which seems suspicious. Possibly the resource being leaked is a socket or file handle?

@apoelstra
Copy link
Owner

Yep -- those are my guesses. I'll try to poke at this later today, although I don't have a Mac so it might be a bit annoying.

On your Mac could you try running your test in strace and see if there's some particular syscall (like open on a socket?) which is timing out?

@casey
Copy link
Contributor Author

casey commented Aug 5, 2022

On your Mac could you try running your test in strace and see if there's some particular syscall (like open on a socket?) which is timing out?

I actually tried to, but I'm on a recent mac OS with security features that prevent dtrace (the strace equivalent available on macos) from running, unless I reboot into recovery mode and run some scary commands. I have another mac that I can do this on, but I'd need to wipe it first.

@apoelstra
Copy link
Owner

After reading the code (and using strace) I'm fairly confident there are no resource leaks. I suspect that on Mac you just aren't allowed to open 2^16 sockets even if you close the old ones (at least, you can't do this in very rapid succession).

The previous code used Hyper, which just maintained a socket pool. We should probably do the same (well, we'd have a single socket rather than a pool .. but still we should cache it).

The reason we didn't do this in the past is that TCP sockets suck -- they can go stale and you can't tell they're dead until you actually try sending and then you've gotta re-try, so you better not have handed ownership over any objects to the serialization code. None of this is a big deal but we were trying to go for maximal simplicity.

I don't think this'll be too hard to fix, especially as we have the old hyper code lying around which tells us which particular io errors mean "broken pipe, just try again" and which ones are real errors.

@casey
Copy link
Contributor Author

casey commented Aug 7, 2022

After reading the code (and using strace) I'm fairly confident there are no resource leaks. I suspect that on Mac you just aren't allowed to open 2^16 sockets even if you close the old ones (at least, you can't do this in very rapid succession).

This sounds right. It might have to do with open ports, since each outgoing socket will allocate a local ephemeral port.

The previous code used Hyper, which just maintained a socket pool. We should probably do the same (well, we'd have a single socket rather than a pool .. but still we should cache it).

Another option to consider making the back-end pluggable. My application already uses hyper, so there aren't any additional security/bloat considerations if I use it in a dependency.

All this makes me wish that Bitcoin Core had a simple binary RPC API over a local FIFO, so that if you didn't need remote access, you didn't have to think about HTTP, security, ports, sockets.

@casey
Copy link
Contributor Author

casey commented Aug 8, 2022

More digging, and it looks like it's the ephemeral port range that's being exhausted. MacOS's ephemeral port range is 49152 to 65535, which is 16383 ports. If reusing the socket is hard and annoying, then reusing the port might be a lower impact option.

@apoelstra
Copy link
Owner

The back-end is actually pluggable -- we have a Transport trait that you can implement using Hyper if you want (and I think we'd accept a PR to use hyper if you wanted to upstream it here, though it would need to be feature gated).

But I think just reusing the socket would be easier/not too bad. It's on my list.

@gwillen
Copy link

gwillen commented Aug 9, 2022

I suggest trying to set SO_REUSEADDR on the sockets, and possibly also disable SO_LINGER -- this may help if the issue is you're trying to reuse ports too fast, while the old socket is still in TIME_WAIT (closed but not fully cleaned up yet.) I don't know why it's only seen on mac -- one possibility is that Linux only applies this to sockets bound to a specific port, and mac is also applying it to ephemeral ones. (Or maybe the mac just has a much smaller legal range for ephemeral ports.)

(Obviously just reusing the socket is better, if you're making a ton of calls in rapid succession. But this might be a lot easier to try.)

@apoelstra
Copy link
Owner

@casey are you able to try this locally? I don't have a Mac to play with.

@casey
Copy link
Contributor Author

casey commented Aug 12, 2022

@apoelstra I should be able to fool around with this tomorrow .

@casey
Copy link
Contributor Author

casey commented Aug 13, 2022

Narrator: “He didn’t have time to fool around with it tomorrow.” Definitely early next week though!

@casey
Copy link
Contributor Author

casey commented Aug 16, 2022

We just tried setting SO_REUSEADDR, SO_REUSEPORT, and disabling SO_LINGER, and they don't seem to have helped. Reusing the socket might be the best option.

@apoelstra
Copy link
Owner

Surprising! But ok, we'll reuse the socket (probably wise regardless).

@raphjaph raphjaph mentioned this issue Oct 26, 2022
apoelstra added a commit that referenced this issue Nov 17, 2022
e867374 Reuse HTTP connection (raphjaph)

Pull request description:

  This is a quick and dirty attempt at fixing #67 that reuses the socket for multiple requests. It has few issues:
  - no error handing, client becomes unusable after encountering a single error
  - `SimpleHttpTransport::request()` takes a `mut` reference to self, which cascades to all downstream users
  - `TcpStream` is not `Clone`, therefore I had to remove the derive `Clone`

  An alternate implementation based on `reqwest` (`hyper` seems too low-level because of the async stuff) wouldn't have these problems.  So I think I'm going to start on that. Just wanted so share this draft PR if anyone has feedback or has an idea on how to do this without `reqwest`.

ACKs for top commit:
  apoelstra:
    ACK e867374

Tree-SHA512: 470e740b7bff7e323bdce014e73e44228e88ed332b3cb49d01302a9d591e51a1ffb25bd3cbcb6413090d89a627e289a04685c6dea615b968190a36635450d486
@apoelstra
Copy link
Owner

Fixed by #72

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants