Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io: add connection backoff #3191

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

kabakaev
Copy link
Contributor

@kabakaev kabakaev commented Mar 9, 2021

In normal operation, fluent-bit reuses TCP connections, hence new messages are flushed without sending a TCP SYN.
But if an output connection cannot be established, then each flb_io_net_write() call will trigger connection setup and will send a series of TCP SYN packets (one per thread?).

The actual issue is described in #3103.

We observed this issue when hundreds of fluent-bit agents tried to send logs via forward to a set of receiving fluent-bits, which were all down due to config error. The receiving FLB was hosted behind an openstack load balancer, a Linux stateful firewall and a traefik ingress controller.

Apart from high load, the flood of SYN packets may exhaust the connection tracking table, impacting the whole network infrastructure.

Fixes #3103.

This PR is inspired by GRPC backoff implementation.

Backoff is disabled by default.

If enabled, backoff will limit the number of TCP SYN packets during an output destination outage (raw data):
image

This chart shows rate of TCP SYN packets. The data is collected by tcpdump as described in How to test section below.

Testing

Example of backoff configuration is given below.

Valgrind output is uploaded to my gist.

Documentation

Documentation for this feature is submitted as docs PR491.

How to test

Compile this version:

cd build
cmake -DFLB_RELEASE=On \
          -DFLB_TRACE=On \
          -DFLB_JEMALLOC=On \
          -DFLB_TLS=On \
          -DFLB_SHARED_LIB=Off \
          -DFLB_EXAMPLES=Off \
          -DFLB_HTTP_SERVER=On \
          -DFLB_IN_SYSTEMD=On ../ \
&& make -j$(getconf _NPROCESSORS_ONLN)

Collect SYN packets without backoff

Simulate connection timeout and run tcpdump on a separate console:

iptables -I INPUT -p tcp --dport 24224 -j DROP
timeout 6m tcpdump -lnn -i any dst host 127.0.0.1 and dst port 24224 and tcp[tcpflags] == tcp-syn | tee run5m_backoff0.tcpdump
# 1802 packets captured
iptables -D INPUT -p tcp --dport 24224 -j DROP

and start fluent-bit without backoff settings:

timeout -s SIGKILL 5m \
  bin/fluent-bit -vv \
    -i dummy -p 'rate=1000000' \
    -o forward://127.0.0.1:24224 -p 'retry_limit=1' \
  2>&1 | tee run5m_backoff0.log
# Fluent Bit v1.8.0
# ...
# [2021/03/08 18:27:32] [ warn] [engine] failed to flush chunk '869680-1615224452.505695198.flb', retry in 6 seconds: task_id=189, input=dummy.0 > output=forward.0 (out_id=0)
# Killed

Collect SYN packets with initial backoff of 1 second

Simulate connection timeout and run tcpdump on a separate console:

iptables -I INPUT -p tcp --dport 24224 -j DROP
timeout 6m tcpdump -lnn -i any dst host 127.0.0.1 and dst port 24224 and tcp[tcpflags] == tcp-syn | tee run5m_backoff1.tcpdump
# 360 packets captured
iptables -D INPUT -p tcp --dport 24224 -j DROP

and start fluent-bit with backoff settings:

timeout -s SIGKILL 5m \
  bin/fluent-bit -vv \
    -i dummy -p 'rate=1000000' \
    -o forward://127.0.0.1:24224 -p 'retry_limit=1' -p 'net.backoff_init=1' -p 'net.backoff_max=60' \
  2>&1 | tee run5m_backoff1.log
# Fluent Bit v1.8.0
# ...
# [2021/03/08 18:27:31] [debug] [upstream] skipping connection to 127.0.0.1:24224 because of connection backoff for another 28 seconds
# [2021/03/08 18:27:32] [ warn] [engine] failed to flush chunk '869680-1615224452.505695198.flb', retry in 6 seconds: task_id=189, input=dummy.0 > output=forward.0 (out_id=0)
# Killed

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Alexander Kabakaev [email protected], Daimler TSS GmbH, imprint

include/fluent-bit/flb_io.h Outdated Show resolved Hide resolved
src/flb_io.c Outdated Show resolved Hide resolved
src/flb_io.c Outdated Show resolved Hide resolved
src/flb_upstream.c Outdated Show resolved Hide resolved
@kabakaev kabakaev force-pushed the connection-backoff branch 2 times, most recently from 29b0523 to f449406 Compare March 15, 2021 08:48
@kabakaev kabakaev changed the title [io] add connection backoff io: add connection backoff Mar 15, 2021
src/flb_io.c Outdated
@@ -63,9 +63,56 @@
#include <fluent-bit/flb_coro.h>
#include <fluent-bit/flb_http_client.h>

/* Increase backoff time of an upstream */
void flb_io_backoff_upstream(struct flb_upstream *u)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe declare static instead of adding to header?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same with flb_io_in_backoff)

@@ -342,6 +393,10 @@ static FLB_INLINE ssize_t net_io_read_async(struct flb_coro *co,
int flb_io_net_write(struct flb_upstream_conn *u_conn, const void *data,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also handle backoff in flb_io_net_read?

@kabakaev kabakaev force-pushed the connection-backoff branch from f449406 to 4358cea Compare March 16, 2021 14:14
Copy link
Member

@edsiper edsiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this contribution!, minor changes are requested.

@@ -65,6 +65,18 @@ struct flb_config_map upstream_net[] = {
"before it is retired."
},

{
FLB_CONFIG_MAP_TIME, "net.initial_backoff", "0s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • can you rename the property to something like net.backoff_init ?

},

{
FLB_CONFIG_MAP_TIME, "net.max_backoff", "0s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before: net.backoff_max

@@ -83,6 +83,10 @@ struct flb_upstream {
#endif

struct mk_list _head;

/* Backoff state. */
time_t next_attempt_time;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please prefix the variable with backoff_...


/* Backoff state. */
time_t next_attempt_time;
int last_backoff_seconds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do a prefix: backoff...

@edsiper edsiper self-assigned this Mar 20, 2021
@edsiper edsiper added the waiting-for-user Waiting for more information, tests or requested changes label Mar 20, 2021
@kabakaev kabakaev force-pushed the connection-backoff branch from 4358cea to 4a5e951 Compare March 22, 2021 09:44
Signed-off-by: Alexander Kabakaev <[email protected]>
@kabakaev kabakaev force-pushed the connection-backoff branch from 4a5e951 to b58cfc0 Compare March 22, 2021 16:14
@kabakaev
Copy link
Contributor Author

kabakaev commented Apr 7, 2021

minor changes are requested.

@edsiper, thanks for quick review! The suggested changes are implemented. PTAL

@edsiper
Copy link
Member

edsiper commented Dec 12, 2021

@kabakaev can you pls fix the conflicts so we can do final review/merge ?

@lecaros
Copy link
Contributor

lecaros commented Feb 11, 2022

Hi @kabakaev, can you please review the requested changes?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TCP SYN flood circuit breaker
4 participants