Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v22.3.x] net: Explicitly set and reduce TCP keepalive on the kafka API #11775

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #11496

We have seen RP "leak" client connections in different scenarios ([1] [2]).

One of those cases is when running in cloudv2 on AWS. The inuse AWS
load balancer which distributes bootstrap server connections to all
brokers "drops" connections after 350s. This means that when the
client eventually disconnects the LB doesn't forward the RST/FIN to the
RP brokers anymore. As a result RP thinks those connections are still
going. Scenarios where nodes/VMs just crash result in similar scenarios.

Redpanda right now doesn't have something like an application level
"connection reaper" that closes inactive connections.

However, the issue above eventually gets resolved by TCP keepalive. We
do already enable TCP keepalive but don't specify any of the parameters
explicitly which means we use the linux defaults (or whatever is
configured).

The default (and as used in cloudv2) has a TCP idle timeout of 7200
seconds. Hence it takes a bit more than two hours for those connections
to get cleaned up.

This PR makes all the three TCP keepalive parameters configurable and
explicitly sets them on Kafka connections.

As part of that we also lower the values so that it triggers a lot
earlier.

The new defaults (in RP) are:
 - Idle timeout: 120s (vs 7200s linux default)
 - Interval: 60s (vs 75s linux default)
 - Probes: 3 (vs 9 linux default)

As a result on idle connections we send a TCP keep alive (this is just a
TCP packet without data) every 2 minutes. For a very large idle set of
connections of something like 30k this would result in about 250 packets
every second which shouldn't be of issue.

On dead connections we send the first TCP keepalive after 2 minutes.
Then 2 more packets in one minute intervals and eventually close the
connection after a total of 5 minutes idle time.

Testing keepalive is slightly tricky as we need to convince the client
to stop responding to the keepalive packets. Given this is done
implicilty by the kernel there is no easy switch to stop that.

We use an iptables rule that drops all outgoing packets from the
client which means no tcp keepalive response packets will reach RP and
subsequently RP will RST the connection. To make sure that we don't drop
any other packets and also in case we leak the rule for any reason we
create a random group and use iptables owner module to apply the rule to
that group only.

[1] Issue redpanda-data/cloudv2#6713
[2] Issue redpanda-data/core-internal#411

(cherry picked from commit a90cb32)
@vbotbuildovich vbotbuildovich added this to the v22.3.x-next milestone Jun 29, 2023
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 29, 2023
@StephanDollberg
Copy link
Member

/ci-repeat 1

@StephanDollberg StephanDollberg marked this pull request as ready for review July 4, 2023 18:01
@StephanDollberg StephanDollberg merged commit cceefd9 into redpanda-data:v22.3.x Jul 7, 2023
@BenPope BenPope modified the milestones: v22.3.x-next, v22.3.23 Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants