-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking connect on host down with kevent backend #442
Comments
does this happen on linux or is this just freebsd only? |
Until now we don't run twemproxy on Linux in a production webserver env, so It will take time to setup a linux prod machine. Am Donnerstag, 10. Dezember 2015 schrieb Manju Rajashekhar :
|
After some deeper analysis, we discovered that the problem is related to the way twemproxy implements the reconnection to a lost server: after "server_retry_timeout" milliseconds, twemproxy inserts the lost server back to the pool, considering it good and redirecting traffic to it, even if that server is not yet connected. On FreeBSD (similar behavior is expected on Linux as well, but not yet proved), the connect() can fail in two ways:
As long as the error number doesn't reach the error limit, this server is in the pool and traffic is directed to it. In the second case (no MAC address), twemproxy needs a lot of time to reach the error limit, redirecting a lot of traffic to a server that doesn't exist (the outgoing traffic is placed in a queue, waiting the MAC address from the ARP request, needed to build the ethernet frame). Ideally, twemproxy should first connect to the server and, if successful, put the connected server to the pool, but I see that it's not that easy as it looks like. |
Related to #608 - the heartbeat/failover patches planned for 0.6.0 (not merged yet) should greatly reduce the amount of time clients spend waiting for twemproxy to finish re-establishing connections to failing servers. |
Overview
We encountered a strange behavior when a memcached host out of a ketama pool is taken down (by pulling the network cable).
Twemproxy is returning timeouts directly after the server becomes unreachable. However, after the ARP cache is flushed (every 20min on FreeBSD) twemproxy suddenly starts to return server errors and the response time increases for every reconnection attempt.
Version and OS
Twemproxy 0.4.1
Freebsd 10.1
Webserver load metrics
Twemproxy error metrics
Twemproxy config
24 memcached pools like this:
example error log entry
Investigation
The error above is produced by nc_server.c:552 (the log entry is misleading), which is strange as the socket is created as non-blocking (nc_server.c:516). In case of a non blocking socket a "Host is down" error should have been returned by select() or in this case kevent() as in nc_kqueue.c:273 ff..
We actually see this happen before the ARP cache flush. When kevent encounters a connection related error, a timeout and a forward error are being logged.
This is what we see in the first part of the twemproxy error graph.
The second part however is likely caused by connect() from nc_server.c as the corresponding error messages appear exactly when the server errors start appearing and the response times go up.
The conclusion we draw from this is that connect() is directly or indirectly blocking the twemproxy mainloop after the ARP cache is flushed:
With a blocking code path timeouts are multiplied by the set retry count, i.e. 3x300ms before a server is removed. In contrast to that an async code path can handle 3 or more events in parallel, i.e. there is only a total timeout of ~300ms until the server is removed.
This explains why we see an increase of repsonse times after the ARP flush.
The only problem here is that no timeout is ever set for connect(). So this call is probably not directly responsible for a blocking behavior.
We think that the failing connect() is indirectly responsible for triggering an early timeout on kevent because of an empty queue (or sth. like that). In addition to that connect() might actually return an error from a previous connection attempt that timed out (e.g. kevent() giving EINPROGRESS).
It would be great if someone with a little more knowledge on kevent and/or async sockets could have a look at this.
The text was updated successfully, but these errors were encountered: