-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronous Keepalive Networking Bug [All versions] #6821
Comments
@leonardo-albertovich @edsiper What do you think? Am I correct that this is a bug? |
I need to take a deeper look at the code paths used when the socket is in synchronous mode to have an opinion, however, I always found that statement in the detroy_conn function to be very flimsy and unreliable (I had an issue where a stream was wrongfully marked). What I thought at the moment was we shouldn't care about the socket mode at that point but rather that if it's registered we want to unregister it so I wanted to switch Anyway, I'll take a look and reply again when I have enough information. |
@leonardo-albertovich Thanks! And I think checking |
Manually checking the status field would work but would imply crossing a boundary that I would rather not, even if it was common practice in the past, I think we need to try to abstain from doing so in the future in favor of having self contained components / layers to improve the code quality. To be clear, in this case we only have two possible values for the status field but it would suck if we broke that boundary to indirectly check the value because it could cause a side effect in the future if someone added a new status code or decided to turn that into a bit mask field. |
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: #6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: #6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Resolves this issue: fluent#6821 Signed-off-by: Wesley Pettit <[email protected]>
Synchronous Keepalive Networking Bug
Credit
I want to be clear that I only wrote this explainer, @matthewfala discovered this issue.
Impact
Fluent Bit can crash when
net.keepalive
is enabled (the default) and the synchronous networking stack is used. Currently, the Amazon S3 plugin is the main output plugin that uses sync networking.The stack trace in below was obtained from a customer case and shows what can occur when this bug surfaces.
Root Cause
Relevant code in 2.0.9:
Relevant code in 1.9.10:
When
net.keepalive
is enabled (the default), Fluent Bit will try to keep connections open so that they can be re-used. If the connection is closed for any reason (by remote server, or some networking issue), then Fluent Bit can no longer re-use the connection. Therefore, it always inserts an event on the event loop to monitor for connection close:When the connection is freed or cleaned up for any reason, the event must be removed. Currently, in
prepare_destroy_conn
the event is only removed for the async case:This means that in the sync case, the event can remain on the event loop. If it was already triggered and is pending processing, or is triggered subsequently, it could run on the already freed connection leading to an invalid memory access and a SIGSEGV crash like the one shown above.
We suspect that the code is in this state possibly both because the keepalive code is newer, and also because the sync case covers connections created by filters and in output plugin init callbacks, both of which cannot use async networking.
Solution
Simply add an additional flag on the connection that tracks whether or not the keepalive close event was added, and check this to determine if we should remove the event. Since a
mk_event
is only added either for async networking and/or for keepalive connection close monitoring, this covers all cases.The text was updated successfully, but these errors were encountered: