Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing documentation for configuring TCP keepalives #45081

Closed
ywelsch opened this issue Aug 1, 2019 · 9 comments
Closed

Missing documentation for configuring TCP keepalives #45081

ywelsch opened this issue Aug 1, 2019 · 9 comments
Labels
:Distributed/Network Http and internode communication implementations >docs General docs changes Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team

Comments

@ywelsch
Copy link
Contributor

ywelsch commented Aug 1, 2019

The docs currently mention that

Elasticsearch opens a number of long-lived TCP connections between each pair of nodes in the cluster, and some of these connections may be idle for an extended period of time. Nonetheless, Elasticsearch requires these connections to remain open, and it can disrupt the operation of the cluster if any inter-node connections are closed by an external influence such as a firewall. It is important to configure your network to preserve long-lived idle connections between Elasticsearch nodes, for instance by leaving tcp_keep_alive enabled and ensuring that the keepalive interval is shorter than any timeout that might cause idle connections to be closed, or by setting transport.ping_schedule if keepalives cannot be configured.

The docs don't mention how to configure this for any of the supported platforms, and also do not mention concrete values for the system-level keepalive parameters. In particular, the docs state that configuration system-level keepalives is the preferred way.

@ywelsch ywelsch added >docs General docs changes :Distributed/Network Http and internode communication implementations labels Aug 1, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-docs

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov
Copy link
Contributor

The docs also also mentions that

It is preferable to correctly configure TCP keep-alives instead of using this feature, because TCP keep-alives apply to all kinds of long-lived connections and not just to transport connections.

What other types of connections are meant here? TCP keep-alives is OS-specific feature, so it's better not to rely on them if possible and recommend using transport.ping_schedule instead.

@andrershov
Copy link
Contributor

We discussed the issue on the team meeting and it seems that despite our custom pings could be better for transport connections (because they are platform-independent), there are other connection types, where we just could not implement custom pings. This includes, for example, HTTP connections established to S3/GCS/Azure for snapshotting, HTTP connections established by watcher, etc.
If there is a long-running running HTTP request (for example, watcher notifies external service which slowly responds) and firewall configuration is insanely aggressive (close all connections idle for 30 seconds) without TCP keep-alive connection will be closed. This is not desired, because despite having a retry for most of our HTTP requests, a remote endpoint response generation could take longer than firewall timeout.
I believe the proper solution would be to tweak firewall settings, however, this is not always under the users' control.
Regarding documenting OS-specific keep-alive settings, we think that this is not something we want to do, because this could be easily be found online and we strive not to copy paster documentation from somewhere else.

@andrershov
Copy link
Contributor

There is an issue about adding docs for firewall configuration, see #14848. Probably we should document TCP keep-alive configuration along with firewall configuration.
For example, "please configure firewall properly, if it is not possible configure TCP keep-alive instead."
With this in place do we need custom pings at all?
Please note that custom pings don't serve today for failure detection purposes.

@wchrisdean
Copy link
Contributor

[doc issue triage]

@ppf2
Copy link
Member

ppf2 commented Feb 3, 2020

Repeatedly dropped connections (unreliable network) will severely impact Elasticsearch's operations if long-lived idle connections are not preserved between nodes.

As part of improving the documentation in this area, I would also like to see this important piece somewhere at the installation/setup level of our documentation given the number of transport disconnect issues we have seen in the field due to not having keep alive.

I propose that we cross-reference this topic under " Important System Configuration" (https://www.elastic.co/guide/en/elasticsearch/reference/current/system-config.html)? The reason being that it is unlikely for an admin to end up here (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-transport.html#_long_lived_idle_connections) unless they specifically read about the transport module.

@rjernst rjernst added Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team labels May 4, 2020
@DaveCTurner
Copy link
Contributor

This issue is mostly addressed by #59278 (and #60216) for Linux and macOS. I think we shouldn't go into depth about how to configure keepalives on Windows; the relevant docs are version-specific and have a habit of moving around so I'd prefer we leave it to the user to search them out.

@ppf2's most recent request will be addressed by #60268.

@DaveCTurner
Copy link
Contributor

This issue is now resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Network Http and internode communication implementations >docs General docs changes Team:Distributed Meta label for distributed team Team:Docs Meta label for docs team
Projects
None yet
Development

No branches or pull requests

7 participants