Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whois tool dies frequently #21

Open
whym opened this issue Aug 13, 2021 · 14 comments
Open

Whois tool dies frequently #21

whym opened this issue Aug 13, 2021 · 14 comments

Comments

@whym
Copy link
Owner

whym commented Aug 13, 2021

https://whois.toolforge.org/ has been unstable for weeks. It dies a couple of times a day, although usually it gets restarted after a while of unavailability.

Here is what appears like relevant part of the error messages: errors.txt

2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 249 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 243 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 241 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 245 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 255 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 229 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 213 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 237 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 251 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 197 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 195 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 157 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 165 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 173 
2021-08-13 10:44:43: (mod_cgi.c.1049) CGI pid 8284 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8308 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8312 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8322 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8282 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8297 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8315 died with signal 9 
2021-08-13 10:44:45: (mod_cgi.c.1049) CGI pid 8283 died with signal 9 
2021-08-13 10:44:46: (gw_backend.c.335) child signalled: 9 
2021-08-13 10:44:47: (gw_backend.c.335) child signalled: 9 
2021-08-13 10:44:47: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.whois-1 after connect failed: Connection refused 
2021-08-13 10:44:48: (mod_cgi.c.1049) CGI pid 8301 died with signal 9 
2021-08-13 10:44:48: (mod_cgi.c.1049) CGI pid 8291 died with signal 9 
2021-08-13 10:44:48: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.whois-0 after connect failed: Connection refused 
2021-08-13 10:44:49: (server.c.1863) connection closed - keep-alive timeout: 211 
2021-08-13 10:44:49: (server.c.1863) connection closed - keep-alive timeout: 203 
2021-08-13 10:44:49: (server.c.1863) connection closed - keep-alive timeout: 155 
2021-08-13 10:44:49: (mod_cgi.c.1049) CGI pid 8268 died with signal 9 
2021-08-13 10:44:49: (mod_cgi.c.1049) CGI pid 8298 died with signal 9 
2021-08-13 10:44:49: (mod_cgi.c.1049) CGI pid 8307 died with signal 9 
2021-08-13 10:44:49: (mod_cgi.c.1049) CGI pid 8265 died with signal 9 
2021-08-13 10:44:50: (server.c.1863) connection closed - keep-alive timeout: 181 
2021-08-13 10:44:50: (server.c.1863) connection closed - keep-alive timeout: 231 
2021-08-13 10:44:50: (server.c.1863) connection closed - keep-alive timeout: 217 
2021-08-13 10:44:50: (server.c.1863) connection closed - keep-alive timeout: 153 
2021-08-13 10:44:51: (gw_backend.c.335) child signalled: 9 
2021-08-13 10:44:51: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.whois-1 after connect failed: Connection refused 
2021-08-13 10:44:51: (gw_backend.c.335) child signalled: 9 
2021-08-13 10:44:51: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.whois-0 after connect failed: Connection refused 

(The linked file contains a larger portion from the log file.)

Here is my tentative plan. Coincidentally I have been preparing a flask version of the tool inspired by @wiki-ST47 's fork, and it's ready to be tested. ( #20 ) If it serves automated traffic to the JSON endpoint well (and assuming that's the cause), the new version could be the solution.

I don't know what is happening from the logs above, so the replacement might or might not solve it. If there is an identifiable cause, I can work on it (and work on the switch independently, perhaps later).

@whym whym pinned this issue Aug 13, 2021
@whym
Copy link
Owner Author

whym commented Aug 14, 2021

Two main questions I have. In what context the tool dies? (I'm guessing that it's in the middle of a large number of automated requests, but I'm not entirely sure.) Is https://whois-dev.toolforge.org/ a good replacement?

@rail01
Copy link

rail01 commented Aug 17, 2021

Thanks for investigating this! Whois-dev's API appears to output JSON data with the correct MIME type, but all my requests to it are marked by Firefox as slower than expected. Otherwise, the dev version looks fine.

@whym
Copy link
Owner Author

whym commented Aug 18, 2021

Thanks for testing.

but all my requests to it are marked by Firefox as slower than expected.

Can you tell me how I can replicate that? When I open a JSON with Firefox, I don't see anything like that.
screenshot

@whym
Copy link
Owner Author

whym commented Aug 25, 2021

whois-dev gets a high number of requests now (>20000 per day), and it appears that it dies in proportion.

Tue Aug 24 18:33:53 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:54 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:55 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:56 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:57 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:58 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***

Rate-limiting seems like one way to mitigate this (assuming the number of users, as opposed to the number of requests, is not too large). Fortunately it looks like there is a ready-made implementation of rate limiting for flask.

An alternative might be to ask for more resource from the toolforge infrastructure: https://phabricator.wikimedia.org/T245426 .
I'm not keen to allow too many requests per minute, however. https://phabricator.wikimedia.org/T265784

@AntiCompositeNumber
Copy link

Rate-limiting by IP is not possible on Toolforge, as IP information is intentionally stripped by the reverse proxy. It should be possible to collect user agent information though (Request.user_agent.string), which should actually give more information on what's causing the high request rates.

@whym
Copy link
Owner Author

whym commented Aug 26, 2021

It should be possible to collect user agent information though

@AntiCompositeNumber Thank you for the suggestion. I'm trying that.

It turns out there were unusually slow queries.

[pid: 13|app: 0|req: 28765/112583] 192.168.247.73 () {54 vars in 943 bytes} [Thu Aug 26 09:06:59 2021] GET /w/197.98.140.170/lookup/json => generated 547 bytes in 347295 msecs (HTTP/1.1 200) 4 headers in 141 bytes (1 switches on core 0)
[pid: 13|app: 0|req: 28765/112583] 192.168.247.73 () {54 vars in 943 bytes} [Thu Aug 26 09:06:59 2021] GET /w/197.98.140.170/lookup/json => generated 547 bytes in 347295 msecs (HTTP/1.1 200) 4 headers in 141 bytes (1 switches on core 0)

[pid: 12|app: 0|req: 29277/113890] 192.168.190.202 () {54 vars in 941 bytes} [Thu Aug 26 10:41:55 2021] GET /w/105.112.105.186/lookup/json => generated 640 bytes in 140608 msecs (HTTP/1.1 200) 4 headers in 141 bytes (1 switches on core 0)
[pid: 12|app: 0|req: 29277/113890] 192.168.190.202 () {54 vars in 941 bytes} [Thu Aug 26 10:41:55 2021] GET /w/105.112.105.186/lookup/json => generated 640 bytes in 140608 msecs (HTTP/1.1 200) 4 headers in 141 bytes (1 switches on core 0)

I don't know the root cause, but I know they are rare, well below 1%. I'm trying adding timeouts.

@gstrauss
Copy link

gstrauss commented Aug 30, 2021

I am not sure whether or not adjust lighttpd config settings could help out here. You might try to limit the load using "listen-backlog" => 1024 (the default value) with a lower value for the socket listen backlog in lighttpd.conf scgi.server which spawns the uWSGI backend.

[Edit] you can also restrict the load using attributes "max-procs" and "max-load-per-proc"

I am a lighttpd developer so if you have other suggestions, please let me know. For example, the upcoming lighttpd 1.4.60 will include "connect-timeout", "write-timeout", and "read-timeout" settings for connecting to, writing to, and reading from the uWSGI backend.

@whym
Copy link
Owner Author

whym commented Aug 31, 2021

@gstrauss Thank you for the advice. I was wondering why those timeout parameters mentioned in the official lighttpd documentation didn't work, without bothering to check the version I was dealing with. It turns out that it was lighttpd 1.4.53, so that's why.

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Lighttpd has server.max-connections = 300 (under "Default lighttpd configuration"). It sounds like a value below 300 is appropriate for listen-backlog, according to https://redmine.lighttpd.net/issues/2116 .

@gstrauss
Copy link

gstrauss commented Sep 1, 2021

server.max-connections is the max number of simultaneous connections that lighttpd will accept() and handle in parallel. Any additional connection attempts remain in the listen backlog queue(s) in the kernel for lighttpd listening sockets (e.g. on port 443), and the size of that listen backlog queues for lighttpd listening sockets can be configured with server.listen-backlog.

Separate from those two configuration options is the scgi.server = ( "listen-backlog" => 1024) option which controls the listen backlog queue size for the listening sockets created for backend SCGI (or uWSGI) servers.

@AntiCompositeNumber
Copy link

Looking at the logs for ASNBlock, I am now getting 408 Request Timeout and 429 Too Many Requests errors from the whois-dev API. What ratelimit should I use?

@whym
Copy link
Owner Author

whym commented Sep 16, 2021

@AntiCompositeNumber Currently no more than 60 requests per minute is allowed. It is kind of arbitrary and can be relaxed depending on the needs, but I think there should be some limit.

AntiCompositeNumber added a commit to AntiCompositeNumber/AntiCompositeBot that referenced this issue Sep 18, 2021
Currently set to 1/sec, can tune if that still hits the ratelimit.
utils.Throttle is not concurrency-aware, but with WHOIS result caching
implemented only one of the processes is making the bulk of requests.

See whym/whois-gateway#21
@whym
Copy link
Owner Author

whym commented Sep 26, 2021

I tried some different values for listen-backlog (server.listen-backlog = 1612832) approximately for one week for each and I'm not sure if it made any difference. The smaller values might result in less downtime, like 2 times a day instead of 5 times a day, but it varies every day and there is no clear cut difference.

@gstrauss
Copy link

gstrauss commented Sep 26, 2021

What are you trying to rate limit? I don't think the configuration settings your trying will help much.

Should the PHP keep a database of requests and timestamps and limit the number of requests within a given minute, or else return 503 Service Unavailable with response header Retry-After: 30 to retry in 30 seconds?

@whym
Copy link
Owner Author

whym commented Sep 27, 2021

Part of the reason must be the unusually slow queries mentioned in #21 (comment) . It doesn't happen with the raw whois command but happens with the ipwhois library of Python consistently, both for the CGI version and the flask version. I added timeouts to the flask version, and that seems to have made the service mostly downtime free. However, this is ad hoc. Perhaps I can revisit RDAP for faster responses but it had its own issues. (#5)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants