-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Override keepAlive time to be lower then NLB idle time (350s) #130
Conversation
templates/waggledance.json
Outdated
"systemControls": [ | ||
{ | ||
"namespace": "net.ipv4.tcp_keepalive_time", | ||
"value": "200" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best would be to inject variables here so anyone can override based on their infrastructure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will there be a separate PR for K8S - https://github.com/ExpediaGroup/apiary-federation/blob/master/k8s.tf ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You cannot do this easily in Kubernetes, it's not allowed to set these flags.
You get errors like:
PodSecurityPolicy: unable
--
to admit pod: [pod.spec.securityContext.sysctls[0]: Forbidden: unsafe
sysctl "net.ipv4.tcp_keepalive_time" is not allowed
pod.spec.securityContext.sysctls[1]: Forbidden: unsafe sysctl
"net.ipv4.tcp_keepalive_intvl" is not allowed
pod.spec.securityContext.sysctls[2]: Forbidden: unsafe sysctl
"net.ipv4.tcp_keepalive_probes" is not allowed]
So we need to sort this out separately
@@ -359,3 +359,21 @@ variable "datadog_metrics_enabled" { | |||
type = bool | |||
default = false | |||
} | |||
|
|||
variable "tcp_keepalive_time" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
variables need to be added in README
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup added now
See for instance this blogpost: https://paramount.tech/blog/2021/07/26/mitigation-of-connection-reset-in-aws.html and https://medium.com/tenable-techblog/lessons-from-aws-nlb-timeouts-5028a8f65dda
We've seen a reduced number of 10 mins (connection timeout) reported by waggle dance once we set this lower TCP keepalive settings. Going from 30-40 timeout calls an hour to single digit. Occurrences still happen we suspect because the server side (HMS) should also set similar TCP KeepAlive settings.