Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][io] Update Elasticsearch sink idle cnx timeout to 30s #19377

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -214,10 +214,10 @@ public class ElasticSearchConfig implements Serializable {

@FieldDoc(
required = false,
defaultValue = "5",
help = "Idle connection timeout to prevent a read timeout."
defaultValue = "30000",
help = "Idle connection timeout to prevent a connection timeout due to inactivity."
)
private int connectionIdleTimeoutInMs = 5;
private int connectionIdleTimeoutInMs = 30000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this config should be validated against bulkFlushIntervalInMs when bulk API is enabled - something like connectionIdleTimeoutInMs > 2 * bulkFlushIntervalInMs because it seems the connection will set idle by design in-between flushes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be fine to let the connection get closed in that scenario. My main goal here is to make sure we have working defaults.

Copy link
Contributor

@aymkhalil aymkhalil Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I understand both comments are probably outside scope of this PR: First step is to have working defaults, and later maybe make them foolproof...


@FieldDoc(
required = false,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,6 @@ public RestClient(ElasticSearchConfig elasticSearchConfig, BulkProcessor.Listene
// idle+expired connection evictor thread
this.executorService = Executors.newSingleThreadScheduledExecutor();
this.executorService.scheduleAtFixedRate(() -> {
configCallback.connectionManager.closeExpiredConnections();
configCallback.connectionManager.closeIdleConnections(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Is it required at all to evict idle connections? I wonder what's wrong with long lived connection that has a life cycle coupled with that of the sink instance. If it is not required, we could drop the connectionIdleTimeoutInMs for good but I maybe missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The motivation of this PR assumes that closing these connections is necessary, however, I am not sure that it is. The fundamental risk is that something between the client and the server closes the connection. In my mind, the canonical example is a network load balancer with a 4 or 5 minute timeout.

Closing expired and idle connections is one solution to prevent such errors due to inactivity.

While troubleshooting the underlying behavior this PR aims to fix, I came across elastic/elasticsearch#65213, which indicates that an alternative solution is to enable socket keepalives and to decrease the net.ipv4.tcp_keepalive_time in order to make sure those keepalives are sent before any intermediate server closes the connection due to inactivity. Since that solution requires modifying OS settings, I think this solution might be easier to maintain, even though it'll be less efficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading that elasticsearch issue, it could be reasonable to move in the direction of enabling tcp keep-alives. At the very least, I think we should merge this and fix the existing default values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since that solution requires modifying OS settings

@michaeljmarshall On managed cloud k8s environments, the OS settings are already properly tuned. related comment: #14841 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari - that was not the case in the AKS cluster that I was testing with as of yesterday. When I tried to override the settings using https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/#setting-sysctls-for-a-pod, I got an error because overriding net.ipv4.tcp_keepalive_time = 300 is considered "unsafe" by default.

config.getConnectionIdleTimeoutInMs(), TimeUnit.MILLISECONDS);
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ public final void defaultValueTest() throws IOException {
assertEquals(config.isCompressionEnabled(), false);
assertEquals(config.getConnectTimeoutInMs(), 5000L);
assertEquals(config.getConnectionRequestTimeoutInMs(), 1000L);
assertEquals(config.getConnectionIdleTimeoutInMs(), 5L);
assertEquals(config.getConnectionIdleTimeoutInMs(), 30000L);
assertEquals(config.getSocketTimeoutInMs(), 60000);

assertEquals(config.isStripNulls(), true);
Expand Down