Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronous Keepalive Networking Bug [All versions] #6821

Closed
PettitWesley opened this issue Feb 10, 2023 · 4 comments · Fixed by #6845
Closed

Synchronous Keepalive Networking Bug [All versions] #6821

PettitWesley opened this issue Feb 10, 2023 · 4 comments · Fixed by #6845
Assignees
Labels

Comments

@PettitWesley
Copy link
Contributor

PettitWesley commented Feb 10, 2023

Synchronous Keepalive Networking Bug

Credit

I want to be clear that I only wrote this explainer, @matthewfala discovered this issue.

Impact

Fluent Bit can crash when net.keepalive is enabled (the default) and the synchronous networking stack is used. Currently, the Amazon S3 plugin is the main output plugin that uses sync networking.

The stack trace in below was obtained from a customer case and shows what can occur when this bug surfaces.

(gdb) bt
#0  0x00007fd0bba0bca0 in raise () from /lib64/libc.so.6
#1  0x00007fd0bba0d148 in abort () from /lib64/libc.so.6
#2  0x000000000045599e in flb_signal_handler (signal=11) at /tmp/fluent-bit-1.9.10/src/fluent-bit.c:581
#3  <signal handler called>
#4  0x00000000004fd80e in __mk_list_del (prev=0x0, next=0x0) at /tmp/fluent-bit-1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h:87
#5  0x00000000004fd846 in mk_list_del (entry=0x7fd0b4a42a60) at /tmp/fluent-bit-1.9.10/lib/monkey/include/monkey/mk_core/mk_list.h:93
#6  0x00000000004fe703 in prepare_destroy_conn (u_conn=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:443
#7  0x00000000004fe786 in prepare_destroy_conn_safe (u_conn=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:469
#8  0x00000000004ff04b in cb_upstream_conn_ka_dropped (data=0x7fd0b4a429c0) at /tmp/fluent-bit-1.9.10/src/flb_upstream.c:724
#9  0x00000000004e7cf5 in output_thread (data=0x7fd0b612e100) at /tmp/fluent-bit-1.9.10/src/flb_output_thread.c:298
#10 0x0000000000500712 in step_callback (data=0x7fd0b60f4ac0) at /tmp/fluent-bit-1.9.10/src/flb_worker.c:43
#11 0x00007fd0bd6cc44b in start_thread () from /lib64/libpthread.so.0
#12 0x00007fd0bbac752f in clone () from /lib64/libc.so.6

Root Cause

Relevant code in 2.0.9:

Relevant code in 1.9.10:

When net.keepalive is enabled (the default), Fluent Bit will try to keep connections open so that they can be re-used. If the connection is closed for any reason (by remote server, or some networking issue), then Fluent Bit can no longer re-use the connection. Therefore, it always inserts an event on the event loop to monitor for connection close:

/*
 * The socket at this point is not longer monitored, so if we want to be
 * notified if the 'available keepalive connection' gets disconnected by
 * the remote endpoint we need to add it again.
 */
conn->event.handler = cb_upstream_conn_ka_dropped;

ret = mk_event_add(conn->evl,
                   conn->fd,
                   FLB_ENGINE_EV_CUSTOM,
                   MK_EVENT_CLOSE,
                   &conn->event);

When the connection is freed or cleaned up for any reason, the event must be removed. Currently, in prepare_destroy_conn the event is only removed for the async case:

if (flb_stream_is_async(&u->base)) {
    mk_event_del(u_conn->evl, &u_conn->event);
}

This means that in the sync case, the event can remain on the event loop. If it was already triggered and is pending processing, or is triggered subsequently, it could run on the already freed connection leading to an invalid memory access and a SIGSEGV crash like the one shown above.

We suspect that the code is in this state possibly both because the keepalive code is newer, and also because the sync case covers connections created by filters and in output plugin init callbacks, both of which cannot use async networking.

Solution

Simply add an additional flag on the connection that tracks whether or not the keepalive close event was added, and check this to determine if we should remove the event. Since a mk_event is only added either for async networking and/or for keepalive connection close monitoring, this covers all cases.

if (flb_stream_is_async(&u->base) 
    || u_conn->ka_dropped_event_added == FLB_TRUE) {
    mk_event_del(u_conn->evl, &u_conn->event);
}
@PettitWesley
Copy link
Contributor Author

@leonardo-albertovich @edsiper What do you think? Am I correct that this is a bug?

@leonardo-albertovich
Copy link
Collaborator

I need to take a deeper look at the code paths used when the socket is in synchronous mode to have an opinion, however,

I always found that statement in the detroy_conn function to be very flimsy and unreliable (I had an issue where a stream was wrongfully marked).

What I thought at the moment was we shouldn't care about the socket mode at that point but rather that if it's registered we want to unregister it so I wanted to switch flb_stream_is_async with MK_EVENT_IS_REGISTERED but didn't because I didn't have the time to properly test it.

Anyway, I'll take a look and reply again when I have enough information.

@PettitWesley
Copy link
Contributor Author

@leonardo-albertovich Thanks! And I think checking u_conn->event.status != MK_EVENT_NONE could be a good solution too.

@leonardo-albertovich
Copy link
Collaborator

Manually checking the status field would work but would imply crossing a boundary that I would rather not, even if it was common practice in the past, I think we need to try to abstain from doing so in the future in favor of having self contained components / layers to improve the code quality.

To be clear, in this case we only have two possible values for the status field but it would suck if we broke that boundary to indirectly check the value because it could cause a side effect in the future if someone added a new status code or decided to turn that into a bit mask field.

@PettitWesley PettitWesley self-assigned this Feb 13, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Feb 13, 2023
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Feb 13, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Feb 13, 2023
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
matthewfala pushed a commit to matthewfala/fluent-bit that referenced this issue Feb 23, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Mar 13, 2023
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Apr 11, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Apr 11, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Apr 11, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Apr 12, 2023
leonardo-albertovich pushed a commit that referenced this issue Apr 12, 2023
leonardo-albertovich pushed a commit that referenced this issue Apr 12, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Apr 25, 2023
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue May 2, 2023
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Jun 2, 2023
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue Jun 8, 2023
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
matthewfala pushed a commit to matthewfala/fluent-bit that referenced this issue Sep 23, 2023
PettitWesley added a commit to PettitWesley/fluent-bit that referenced this issue May 22, 2024
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
zhihonl pushed a commit to zhihonl/fluent-bit that referenced this issue Aug 20, 2024
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
swapneils pushed a commit to amazon-contributing/upstream-to-fluent-bit that referenced this issue Oct 3, 2024
Resolves this issue: fluent#6821

Signed-off-by: Wesley Pettit <[email protected]>
singholt pushed a commit to singholt/upstream-to-fluent-bit that referenced this issue Oct 4, 2024
swapneils pushed a commit to amazon-contributing/upstream-to-fluent-bit that referenced this issue Oct 22, 2024
swapneils pushed a commit to amazon-contributing/upstream-to-fluent-bit that referenced this issue Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants