Blocking connect on host down with kevent backend #442

arnecls · 2015-12-09T15:15:06Z

Overview
We encountered a strange behavior when a memcached host out of a ketama pool is taken down (by pulling the network cable).
Twemproxy is returning timeouts directly after the server becomes unreachable. However, after the ARP cache is flushed (every 20min on FreeBSD) twemproxy suddenly starts to return server errors and the response time increases for every reconnection attempt.

Version and OS
Twemproxy 0.4.1
Freebsd 10.1

Webserver load metrics

Twemproxy error metrics

Twemproxy config
24 memcached pools like this:

memcached_default:
  listen: /var/run/twemproxy/memcached_default.socket 0777
  hash: md5
  distribution: ketama
  auto_eject_hosts: true
  backlog: 2048
  preconnect: true
  server_retry_timeout: 30000
  server_failure_limit: 3
  timeout: 300
  servers:
    - 1.2.3.1:11211:1
    - 1.2.3.2:11211:1
    - 1.2.3.3:11211:1
    - 1.2.3.4:11211:1
    - 1.2.3.5:11211:1
    - 1.2.3.6:11211:1

example error log entry

[2015-12-02 9:30:58.884] nc_server.c:531 connect on s 3462 to server '1.2.3.5:11211:1' failed: Host is down

Investigation
The error above is produced by nc_server.c:552 (the log entry is misleading), which is strange as the socket is created as non-blocking (nc_server.c:516). In case of a non blocking socket a "Host is down" error should have been returned by select() or in this case kevent() as in nc_kqueue.c:273 ff..
We actually see this happen before the ARP cache flush. When kevent encounters a connection related error, a timeout and a forward error are being logged.
This is what we see in the first part of the twemproxy error graph.
The second part however is likely caused by connect() from nc_server.c as the corresponding error messages appear exactly when the server errors start appearing and the response times go up.

The conclusion we draw from this is that connect() is directly or indirectly blocking the twemproxy mainloop after the ARP cache is flushed:
With a blocking code path timeouts are multiplied by the set retry count, i.e. 3x300ms before a server is removed. In contrast to that an async code path can handle 3 or more events in parallel, i.e. there is only a total timeout of ~300ms until the server is removed.
This explains why we see an increase of repsonse times after the ARP flush.
The only problem here is that no timeout is ever set for connect(). So this call is probably not directly responsible for a blocking behavior.

We think that the failing connect() is indirectly responsible for triggering an early timeout on kevent because of an empty queue (or sth. like that). In addition to that connect() might actually return an error from a previous connection attempt that timed out (e.g. kevent() giving EINPROGRESS).

It would be great if someone with a little more knowledge on kevent and/or async sockets could have a look at this.

The text was updated successfully, but these errors were encountered:

manjuraj · 2015-12-10T20:49:26Z

does this happen on linux or is this just freebsd only?

andygrunwald · 2015-12-10T21:04:44Z

Until now we don't run twemproxy on Linux in a production webserver env, so
we can't answer it.

It will take time to setup a linux prod machine.
Is there anything we can do / help in the meantime to track this down?

Am Donnerstag, 10. Dezember 2015 schrieb Manju Rajashekhar :

does this happen on linux or is this just freebsd only?

—
Reply to this email directly or view it on GitHub
#442 (comment).

pizzamig · 2017-02-08T10:09:12Z

After some deeper analysis, we discovered that the problem is related to the way twemproxy implements the reconnection to a lost server: after "server_retry_timeout" milliseconds, twemproxy inserts the lost server back to the pool, considering it good and redirecting traffic to it, even if that server is not yet connected.

On FreeBSD (similar behavior is expected on Linux as well, but not yet proved), the connect() can fail in two ways:

if the MAC address is stored in the arp cache, we get a "connection refused" [ECONNREFUSED] after few milliseconds
if the MAC address is not present in the arp cache, we get a "connection timed out" [ETIMEDOUT] after hundreds of milliseconds (~800ms)

As long as the error number doesn't reach the error limit, this server is in the pool and traffic is directed to it. In the second case (no MAC address), twemproxy needs a lot of time to reach the error limit, redirecting a lot of traffic to a server that doesn't exist (the outgoing traffic is placed in a queue, waiting the MAC address from the ARP request, needed to build the ethernet frame).

Ideally, twemproxy should first connect to the server and, if successful, put the connected server to the pool, but I see that it's not that easy as it looks like.

TysonAndre · 2021-07-02T01:24:42Z

After some deeper analysis, we discovered that the problem is related to the way twemproxy implements the reconnection to a lost server: after "server_retry_timeout" milliseconds, twemproxy inserts the lost server back to the pool, considering it good and redirecting traffic to it, even if that server is not yet connected.

Related to #608 - the heartbeat/failover patches planned for 0.6.0 (not merged yet) should greatly reduce the amount of time clients spend waiting for twemproxy to finish re-establishing connections to failing servers.

TysonAndre changed the title ~~Blocking connect on host down~~ Blocking connect on host down with kevent backend Jul 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking connect on host down with kevent backend #442

Blocking connect on host down with kevent backend #442

arnecls commented Dec 9, 2015

manjuraj commented Dec 10, 2015

andygrunwald commented Dec 10, 2015

pizzamig commented Feb 8, 2017 •

edited

Loading

TysonAndre commented Jul 2, 2021

Blocking connect on host down with kevent backend #442

Blocking connect on host down with kevent backend #442

Comments

arnecls commented Dec 9, 2015

manjuraj commented Dec 10, 2015

andygrunwald commented Dec 10, 2015

pizzamig commented Feb 8, 2017 • edited Loading

TysonAndre commented Jul 2, 2021

pizzamig commented Feb 8, 2017 •

edited

Loading