Missing documentation for configuring TCP keepalives #45081

ywelsch · 2019-08-01T10:32:27Z

The docs currently mention that

Elasticsearch opens a number of long-lived TCP connections between each pair of nodes in the cluster, and some of these connections may be idle for an extended period of time. Nonetheless, Elasticsearch requires these connections to remain open, and it can disrupt the operation of the cluster if any inter-node connections are closed by an external influence such as a firewall. It is important to configure your network to preserve long-lived idle connections between Elasticsearch nodes, for instance by leaving tcp_keep_alive enabled and ensuring that the keepalive interval is shorter than any timeout that might cause idle connections to be closed, or by setting transport.ping_schedule if keepalives cannot be configured.

The docs don't mention how to configure this for any of the supported platforms, and also do not mention concrete values for the system-level keepalive parameters. In particular, the docs state that configuration system-level keepalives is the preferred way.

elasticmachine · 2019-08-01T10:32:28Z

Pinging @elastic/es-docs

elasticmachine · 2019-08-01T10:32:30Z

Pinging @elastic/es-distributed

andrershov · 2019-08-15T13:15:31Z

The docs also also mentions that

It is preferable to correctly configure TCP keep-alives instead of using this feature, because TCP keep-alives apply to all kinds of long-lived connections and not just to transport connections.

What other types of connections are meant here? TCP keep-alives is OS-specific feature, so it's better not to rely on them if possible and recommend using transport.ping_schedule instead.

andrershov · 2019-08-30T13:59:39Z

We discussed the issue on the team meeting and it seems that despite our custom pings could be better for transport connections (because they are platform-independent), there are other connection types, where we just could not implement custom pings. This includes, for example, HTTP connections established to S3/GCS/Azure for snapshotting, HTTP connections established by watcher, etc.
If there is a long-running running HTTP request (for example, watcher notifies external service which slowly responds) and firewall configuration is insanely aggressive (close all connections idle for 30 seconds) without TCP keep-alive connection will be closed. This is not desired, because despite having a retry for most of our HTTP requests, a remote endpoint response generation could take longer than firewall timeout.
I believe the proper solution would be to tweak firewall settings, however, this is not always under the users' control.
Regarding documenting OS-specific keep-alive settings, we think that this is not something we want to do, because this could be easily be found online and we strive not to copy paster documentation from somewhere else.

andrershov · 2019-08-30T14:46:55Z

There is an issue about adding docs for firewall configuration, see #14848. Probably we should document TCP keep-alive configuration along with firewall configuration.
For example, "please configure firewall properly, if it is not possible configure TCP keep-alive instead."
With this in place do we need custom pings at all?
Please note that custom pings don't serve today for failure detection purposes.

wchrisdean · 2019-10-24T14:29:06Z

[doc issue triage]

ppf2 · 2020-02-03T18:47:32Z

Repeatedly dropped connections (unreliable network) will severely impact Elasticsearch's operations if long-lived idle connections are not preserved between nodes.

As part of improving the documentation in this area, I would also like to see this important piece somewhere at the installation/setup level of our documentation given the number of transport disconnect issues we have seen in the field due to not having keep alive.

I propose that we cross-reference this topic under " Important System Configuration" (https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config.html)? The reason being that it is unlikely for an admin to end up here (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-transport.html#_long_lived_idle_connections) unless they specifically read about the transport module.

DaveCTurner · 2020-07-28T09:01:59Z

This issue is mostly addressed by #59278 (and #60216) for Linux and macOS. I think we shouldn't go into depth about how to configure keepalives on Windows; the relevant docs are version-specific and have a habit of moving around so I'd prefer we leave it to the user to search them out.

@ppf2's most recent request will be addressed by #60268.

DaveCTurner · 2020-07-30T09:58:12Z

This issue is now resolved.

ywelsch added >docs General docs changes :Distributed/Network Http and internode communication implementations labels Aug 1, 2019

andrershov added the team-discuss label Aug 15, 2019

andrershov removed the team-discuss label Aug 26, 2019

andrershov mentioned this issue Aug 26, 2019

connection closing/timeout issue maybe related to client node? #21326

Closed

andrershov mentioned this issue Aug 30, 2019

[DOCS] Document firewall/network configuration requirements #14848

Closed

rjernst added Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team labels May 4, 2020

DaveCTurner closed this as completed Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing documentation for configuring TCP keepalives #45081

Missing documentation for configuring TCP keepalives #45081

ywelsch commented Aug 1, 2019

elasticmachine commented Aug 1, 2019

elasticmachine commented Aug 1, 2019

andrershov commented Aug 15, 2019

andrershov commented Aug 30, 2019

andrershov commented Aug 30, 2019

wchrisdean commented Oct 24, 2019

ppf2 commented Feb 3, 2020

DaveCTurner commented Jul 28, 2020

DaveCTurner commented Jul 30, 2020

Missing documentation for configuring TCP keepalives #45081

Missing documentation for configuring TCP keepalives #45081

Comments

ywelsch commented Aug 1, 2019

elasticmachine commented Aug 1, 2019

elasticmachine commented Aug 1, 2019

andrershov commented Aug 15, 2019

andrershov commented Aug 30, 2019

andrershov commented Aug 30, 2019

wchrisdean commented Oct 24, 2019

ppf2 commented Feb 3, 2020

DaveCTurner commented Jul 28, 2020

DaveCTurner commented Jul 30, 2020