Whois tool dies frequently #21

whym · 2021-08-13T11:15:05Z

https://whois.toolforge.org/ has been unstable for weeks. It dies a couple of times a day, although usually it gets restarted after a while of unavailability.

Here is what appears like relevant part of the error messages: errors.txt

2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 249 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 243 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 241 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 245 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 255 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 229 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 213 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 237 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 251 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 197 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 195 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 157 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 165 
2021-08-13 10:44:43: (server.c.1863) connection closed - keep-alive timeout: 173 
2021-08-13 10:44:43: (mod_cgi.c.1049) CGI pid 8284 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8308 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8312 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8322 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8282 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8297 died with signal 9 
2021-08-13 10:44:44: (mod_cgi.c.1049) CGI pid 8315 died with signal 9 
2021-08-13 10:44:45: (mod_cgi.c.1049) CGI pid 8283 died with signal 9 
2021-08-13 10:44:46: (gw_backend.c.335) child signalled: 9 
2021-08-13 10:44:47: (gw_backend.c.335) child signalled: 9 
2021-08-13 10:44:47: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.whois-1 after connect failed: Connection refused 
2021-08-13 10:44:48: (mod_cgi.c.1049) CGI pid 8301 died with signal 9 
2021-08-13 10:44:48: (mod_cgi.c.1049) CGI pid 8291 died with signal 9 
2021-08-13 10:44:48: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.whois-0 after connect failed: Connection refused 
2021-08-13 10:44:49: (server.c.1863) connection closed - keep-alive timeout: 211 
2021-08-13 10:44:49: (server.c.1863) connection closed - keep-alive timeout: 203 
2021-08-13 10:44:49: (server.c.1863) connection closed - keep-alive timeout: 155 
2021-08-13 10:44:49: (mod_cgi.c.1049) CGI pid 8268 died with signal 9 
2021-08-13 10:44:49: (mod_cgi.c.1049) CGI pid 8298 died with signal 9 
2021-08-13 10:44:49: (mod_cgi.c.1049) CGI pid 8307 died with signal 9 
2021-08-13 10:44:49: (mod_cgi.c.1049) CGI pid 8265 died with signal 9 
2021-08-13 10:44:50: (server.c.1863) connection closed - keep-alive timeout: 181 
2021-08-13 10:44:50: (server.c.1863) connection closed - keep-alive timeout: 231 
2021-08-13 10:44:50: (server.c.1863) connection closed - keep-alive timeout: 217 
2021-08-13 10:44:50: (server.c.1863) connection closed - keep-alive timeout: 153 
2021-08-13 10:44:51: (gw_backend.c.335) child signalled: 9 
2021-08-13 10:44:51: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.whois-1 after connect failed: Connection refused 
2021-08-13 10:44:51: (gw_backend.c.335) child signalled: 9 
2021-08-13 10:44:51: (gw_backend.c.476) unlink /var/run/lighttpd/php.socket.whois-0 after connect failed: Connection refused

(The linked file contains a larger portion from the log file.)

Here is my tentative plan. Coincidentally I have been preparing a flask version of the tool inspired by @wiki-ST47 's fork, and it's ready to be tested. ( #20 ) If it serves automated traffic to the JSON endpoint well (and assuming that's the cause), the new version could be the solution.

I don't know what is happening from the logs above, so the replacement might or might not solve it. If there is an identifiable cause, I can work on it (and work on the switch independently, perhaps later).

The text was updated successfully, but these errors were encountered:

whym · 2021-08-14T03:58:06Z

Two main questions I have. In what context the tool dies? (I'm guessing that it's in the middle of a large number of automated requests, but I'm not entirely sure.) Is https://whois-dev.toolforge.org/ a good replacement?

rail01 · 2021-08-17T13:27:25Z

Thanks for investigating this! Whois-dev's API appears to output JSON data with the correct MIME type, but all my requests to it are marked by Firefox as slower than expected. Otherwise, the dev version looks fine.

whym · 2021-08-18T09:30:17Z

Thanks for testing.

but all my requests to it are marked by Firefox as slower than expected.

Can you tell me how I can replicate that? When I open a JSON with Firefox, I don't see anything like that.

whym · 2021-08-25T11:55:41Z

whois-dev gets a high number of requests now (>20000 per day), and it appears that it dies in proportion.

Tue Aug 24 18:33:53 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:54 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:55 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:56 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:57 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***
Tue Aug 24 18:33:58 2021 - *** uWSGI listen queue of socket ":8000" (fd: 4) full !!! (100/100) ***

Rate-limiting seems like one way to mitigate this (assuming the number of users, as opposed to the number of requests, is not too large). Fortunately it looks like there is a ready-made implementation of rate limiting for flask.

An alternative might be to ask for more resource from the toolforge infrastructure: https://phabricator.wikimedia.org/T245426 .
I'm not keen to allow too many requests per minute, however. https://phabricator.wikimedia.org/T265784

AntiCompositeNumber · 2021-08-25T16:16:33Z

Rate-limiting by IP is not possible on Toolforge, as IP information is intentionally stripped by the reverse proxy. It should be possible to collect user agent information though (Request.user_agent.string), which should actually give more information on what's causing the high request rates.

whym · 2021-08-26T13:09:10Z

It should be possible to collect user agent information though

@AntiCompositeNumber Thank you for the suggestion. I'm trying that.

It turns out there were unusually slow queries.

[pid: 13|app: 0|req: 28765/112583] 192.168.247.73 () {54 vars in 943 bytes} [Thu Aug 26 09:06:59 2021] GET /w/197.98.140.170/lookup/json => generated 547 bytes in 347295 msecs (HTTP/1.1 200) 4 headers in 141 bytes (1 switches on core 0)
[pid: 13|app: 0|req: 28765/112583] 192.168.247.73 () {54 vars in 943 bytes} [Thu Aug 26 09:06:59 2021] GET /w/197.98.140.170/lookup/json => generated 547 bytes in 347295 msecs (HTTP/1.1 200) 4 headers in 141 bytes (1 switches on core 0)

[pid: 12|app: 0|req: 29277/113890] 192.168.190.202 () {54 vars in 941 bytes} [Thu Aug 26 10:41:55 2021] GET /w/105.112.105.186/lookup/json => generated 640 bytes in 140608 msecs (HTTP/1.1 200) 4 headers in 141 bytes (1 switches on core 0)
[pid: 12|app: 0|req: 29277/113890] 192.168.190.202 () {54 vars in 941 bytes} [Thu Aug 26 10:41:55 2021] GET /w/105.112.105.186/lookup/json => generated 640 bytes in 140608 msecs (HTTP/1.1 200) 4 headers in 141 bytes (1 switches on core 0)

I don't know the root cause, but I know they are rare, well below 1%. I'm trying adding timeouts.

gstrauss · 2021-08-30T07:28:02Z

I am not sure whether or not adjust lighttpd config settings could help out here. You might try to limit the load using "listen-backlog" => 1024 (the default value) with a lower value for the socket listen backlog in lighttpd.conf scgi.server which spawns the uWSGI backend.

[Edit] you can also restrict the load using attributes "max-procs" and "max-load-per-proc"

I am a lighttpd developer so if you have other suggestions, please let me know. For example, the upcoming lighttpd 1.4.60 will include "connect-timeout", "write-timeout", and "read-timeout" settings for connecting to, writing to, and reading from the uWSGI backend.

whym · 2021-08-31T14:24:41Z

@gstrauss Thank you for the advice. I was wondering why those timeout parameters mentioned in the official lighttpd documentation didn't work, without bothering to check the version I was dealing with. It turns out that it was lighttpd 1.4.53, so that's why.

https://wikitech.wikimedia.org/wiki/Help:Toolforge/Web/Lighttpd has server.max-connections = 300 (under "Default lighttpd configuration"). It sounds like a value below 300 is appropriate for listen-backlog, according to https://redmine.lighttpd.net/issues/2116 .

gstrauss · 2021-09-01T02:13:27Z

server.max-connections is the max number of simultaneous connections that lighttpd will accept() and handle in parallel. Any additional connection attempts remain in the listen backlog queue(s) in the kernel for lighttpd listening sockets (e.g. on port 443), and the size of that listen backlog queues for lighttpd listening sockets can be configured with server.listen-backlog.

Separate from those two configuration options is the scgi.server = ( "listen-backlog" => 1024) option which controls the listen backlog queue size for the listening sockets created for backend SCGI (or uWSGI) servers.

AntiCompositeNumber · 2021-09-15T15:27:39Z

Looking at the logs for ASNBlock, I am now getting 408 Request Timeout and 429 Too Many Requests errors from the whois-dev API. What ratelimit should I use?

whym · 2021-09-16T11:58:43Z

@AntiCompositeNumber Currently no more than 60 requests per minute is allowed. It is kind of arbitrary and can be relaxed depending on the needs, but I think there should be some limit.

Currently set to 1/sec, can tune if that still hits the ratelimit. utils.Throttle is not concurrency-aware, but with WHOIS result caching implemented only one of the processes is making the bulk of requests. See whym/whois-gateway#21

whym · 2021-09-26T11:05:34Z

I tried some different values for listen-backlog (server.listen-backlog = 16→128→32) approximately for one week for each and I'm not sure if it made any difference. The smaller values might result in less downtime, like 2 times a day instead of 5 times a day, but it varies every day and there is no clear cut difference.

gstrauss · 2021-09-26T11:41:45Z

What are you trying to rate limit? I don't think the configuration settings your trying will help much.

Should the PHP keep a database of requests and timestamps and limit the number of requests within a given minute, or else return 503 Service Unavailable with response header Retry-After: 30 to retry in 30 seconds?

whym · 2021-09-27T13:36:46Z

Part of the reason must be the unusually slow queries mentioned in #21 (comment) . It doesn't happen with the raw whois command but happens with the ipwhois library of Python consistently, both for the CGI version and the flask version. I added timeouts to the flask version, and that seems to have made the service mostly downtime free. However, this is ad hoc. Perhaps I can revisit RDAP for faster responses but it had its own issues. (#5)

whym pinned this issue Aug 13, 2021

whym mentioned this issue Aug 13, 2021

HTTP 500 on whois.toolforge.org #19

Closed

Pathoschild mentioned this issue Jan 14, 2022

stalktoy: Update whois tool Pathoschild/Wikimedia-contrib#159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whois tool dies frequently #21

Whois tool dies frequently #21

whym commented Aug 13, 2021

whym commented Aug 14, 2021 •

edited

Loading

rail01 commented Aug 17, 2021

whym commented Aug 18, 2021

whym commented Aug 25, 2021 •

edited

Loading

AntiCompositeNumber commented Aug 25, 2021

whym commented Aug 26, 2021

gstrauss commented Aug 30, 2021 •

edited

Loading

whym commented Aug 31, 2021

gstrauss commented Sep 1, 2021

AntiCompositeNumber commented Sep 15, 2021

whym commented Sep 16, 2021

whym commented Sep 26, 2021

gstrauss commented Sep 26, 2021 •

edited

Loading

whym commented Sep 27, 2021

Whois tool dies frequently #21

Whois tool dies frequently #21

Comments

whym commented Aug 13, 2021

whym commented Aug 14, 2021 • edited Loading

rail01 commented Aug 17, 2021

whym commented Aug 18, 2021

whym commented Aug 25, 2021 • edited Loading

AntiCompositeNumber commented Aug 25, 2021

whym commented Aug 26, 2021

gstrauss commented Aug 30, 2021 • edited Loading

whym commented Aug 31, 2021

gstrauss commented Sep 1, 2021

AntiCompositeNumber commented Sep 15, 2021

whym commented Sep 16, 2021

whym commented Sep 26, 2021

gstrauss commented Sep 26, 2021 • edited Loading

whym commented Sep 27, 2021

whym commented Aug 14, 2021 •

edited

Loading

whym commented Aug 25, 2021 •

edited

Loading

gstrauss commented Aug 30, 2021 •

edited

Loading

gstrauss commented Sep 26, 2021 •

edited

Loading