upstream: fix deadlock when destroying connections #4362

edsiper · 2021-11-23T15:10:53Z

When workers are enabled and a timeout occurs in a connection most of
cases a deadlock is held in the active worker:

==1654992== Thread #4: Attempt to re-lock a non-recursive lock I already hold
==1654992== at 0x484BB44: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==1654992== by 0x197579: prepare_destroy_conn_safe (flb_upstream.c:435)
==1654992== by 0x197887: create_conn (flb_upstream.c:533)
==1654992== by 0x197DBB: flb_upstream_conn_get (flb_upstream.c:674)
==1654992== by 0x2396D3: http_post (http.c:86)
==1654992== by 0x23A5E5: cb_http_flush (http.c:338)
==1654992== by 0x17FE6B: output_pre_cb_flush (flb_output.h:511)
==1654992== by 0x503DAA: co_init (amd64.c:117)
==1654992== Lock was previously acquired
==1654992== at 0x484BC0F: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==1654992== by 0x19815F: flb_upstream_conn_timeouts (flb_upstream.c:780)
==1654992== by 0x17FEFC: cb_thread_sched_timer (flb_output_thread.c:58)
==1654992== by 0x193ED7: flb_sched_event_handler (flb_scheduler.c:422)
==1654992== by 0x180672: output_thread (flb_output_thread.c:265)
==1654992== by 0x199602: step_callback (flb_worker.c:44)
==1654992== by 0x484E8AA: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==1654992== by 0x4E3F926: start_thread (pthread_create.c:435)
==1654992== by 0x4ECF9E3: clone (clone.S:100)

The following patch fix the behavior on prepare_destroy_conn_safe by 'trying to acquire'
the mutex lock, if it fails to acquire it, it will asssume it's already locked and no
new lock is required.

Signed-off-by: Eduardo Silva [email protected]

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

Example configuration file for the change
Debug log output from testing the change

Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

When workers are enabled and a timeout occurs in a connection most of cases a deadlock is held in the active worker: ==1654992== Thread #4: Attempt to re-lock a non-recursive lock I already hold ==1654992== at 0x484BB44: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==1654992== by 0x197579: prepare_destroy_conn_safe (flb_upstream.c:435) ==1654992== by 0x197887: create_conn (flb_upstream.c:533) ==1654992== by 0x197DBB: flb_upstream_conn_get (flb_upstream.c:674) ==1654992== by 0x2396D3: http_post (http.c:86) ==1654992== by 0x23A5E5: cb_http_flush (http.c:338) ==1654992== by 0x17FE6B: output_pre_cb_flush (flb_output.h:511) ==1654992== by 0x503DAA: co_init (amd64.c:117) ==1654992== Lock was previously acquired ==1654992== at 0x484BC0F: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==1654992== by 0x19815F: flb_upstream_conn_timeouts (flb_upstream.c:780) ==1654992== by 0x17FEFC: cb_thread_sched_timer (flb_output_thread.c:58) ==1654992== by 0x193ED7: flb_sched_event_handler (flb_scheduler.c:422) ==1654992== by 0x180672: output_thread (flb_output_thread.c:265) ==1654992== by 0x199602: step_callback (flb_worker.c:44) ==1654992== by 0x484E8AA: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so) ==1654992== by 0x4E3F926: start_thread (pthread_create.c:435) ==1654992== by 0x4ECF9E3: clone (clone.S:100) The following patch fix the behavior on prepare_destroy_conn_safe by 'trying to acquire' the mutex lock, if it fails to acquire it, it will asssume it's already locked and no new lock is required. Signed-off-by: Eduardo Silva <[email protected]>

Signed-off-by: Eduardo Silva <[email protected]>

src/flb_upstream.c

Signed-off-by: Eduardo Silva <[email protected]>

github-actions bot added the docs-required label Nov 23, 2021

edsiper removed the docs-required label Nov 23, 2021

edsiper added 4 commits November 23, 2021 11:04

config: add new 'is_shutting_down' field

0347e7a

Signed-off-by: Eduardo Silva <[email protected]>

engine: do not retry if the engine is shutting down

6f3afc2

Signed-off-by: Eduardo Silva <[email protected]>

network: detect if socket has been invalidated

8e1e565

Signed-off-by: Eduardo Silva <[email protected]>

upstream: detect shutdown and reduce log noise

fd1f909

Signed-off-by: Eduardo Silva <[email protected]>

github-actions bot added the docs-required label Nov 23, 2021

network: on tcp connect change exception from error to debug

7904226

Signed-off-by: Eduardo Silva <[email protected]>

jkschulz mentioned this pull request Nov 24, 2021

[Windows 2016] Fluent Bit service enters a flush loop when trying to stop the service #4300

Closed

leonardo-albertovich reviewed Nov 25, 2021

View reviewed changes

src/flb_upstream.c Outdated Show resolved Hide resolved

leonardo-albertovich previously approved these changes Nov 25, 2021

View reviewed changes

upstream: just compare against 'locked' flag

e251378

Signed-off-by: Eduardo Silva <[email protected]>

edsiper dismissed leonardo-albertovich’s stale review via e251378 November 27, 2021 01:03

edsiper removed the docs-required label Nov 27, 2021

github-actions bot added the docs-required label Nov 27, 2021

edsiper marked this pull request as ready for review November 27, 2021 01:04

edsiper added backport to v1.8.x Used to tag items that must be backported to such version. and removed docs-required labels Nov 29, 2021

edsiper merged commit bf0f0d2 into master Nov 29, 2021

lecaros mentioned this pull request Jan 20, 2022

1.8.12: Track release and pendings -WIP- #4645

Closed

15 tasks

lecaros added this to the Fluent Bit v1.8.12 milestone Jan 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

upstream: fix deadlock when destroying connections #4362

upstream: fix deadlock when destroying connections #4362

edsiper commented Nov 23, 2021

upstream: fix deadlock when destroying connections #4362

upstream: fix deadlock when destroying connections #4362

Conversation

edsiper commented Nov 23, 2021