Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override keepAlive time to be lower then NLB idle time (350s) #130

Merged
merged 9 commits into from
Nov 30, 2023

Conversation

patduin
Copy link
Contributor

@patduin patduin commented Nov 28, 2023

See for instance this blogpost: https://paramount.tech/blog/2021/07/26/mitigation-of-connection-reset-in-aws.html and https://medium.com/tenable-techblog/lessons-from-aws-nlb-timeouts-5028a8f65dda
We've seen a reduced number of 10 mins (connection timeout) reported by waggle dance once we set this lower TCP keepalive settings. Going from 30-40 timeout calls an hour to single digit. Occurrences still happen we suspect because the server side (HMS) should also set similar TCP KeepAlive settings.

@patduin patduin marked this pull request as ready for review November 30, 2023 09:49
"systemControls": [
{
"namespace": "net.ipv4.tcp_keepalive_time",
"value": "200"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best would be to inject variables here so anyone can override based on their infrastructure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot do this easily in Kubernetes, it's not allowed to set these flags.
You get errors like:

PodSecurityPolicy: unable
--
to admit pod: [pod.spec.securityContext.sysctls[0]: Forbidden: unsafe
sysctl "net.ipv4.tcp_keepalive_time" is not allowed
pod.spec.securityContext.sysctls[1]: Forbidden: unsafe sysctl
"net.ipv4.tcp_keepalive_intvl" is not allowed
pod.spec.securityContext.sysctls[2]: Forbidden: unsafe sysctl
"net.ipv4.tcp_keepalive_probes" is not allowed]

So we need to sort this out separately

@@ -359,3 +359,21 @@ variable "datadog_metrics_enabled" {
type = bool
default = false
}

variable "tcp_keepalive_time" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variables need to be added in README

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup added now

@patduin patduin merged commit da1353f into master Nov 30, 2023
@patduin patduin deleted the fix/tcp_keep_alive branch November 30, 2023 14:45
@patduin patduin restored the fix/tcp_keep_alive branch November 30, 2023 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants