Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upstream: fix deadlock when destroying connections #4362

Merged
merged 7 commits into from
Nov 29, 2021
Merged

Conversation

edsiper
Copy link
Member

@edsiper edsiper commented Nov 23, 2021

When workers are enabled and a timeout occurs in a connection most of
cases a deadlock is held in the active worker:

==1654992== Thread #4: Attempt to re-lock a non-recursive lock I already hold
==1654992== at 0x484BB44: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==1654992== by 0x197579: prepare_destroy_conn_safe (flb_upstream.c:435)
==1654992== by 0x197887: create_conn (flb_upstream.c:533)
==1654992== by 0x197DBB: flb_upstream_conn_get (flb_upstream.c:674)
==1654992== by 0x2396D3: http_post (http.c:86)
==1654992== by 0x23A5E5: cb_http_flush (http.c:338)
==1654992== by 0x17FE6B: output_pre_cb_flush (flb_output.h:511)
==1654992== by 0x503DAA: co_init (amd64.c:117)
==1654992== Lock was previously acquired
==1654992== at 0x484BC0F: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==1654992== by 0x19815F: flb_upstream_conn_timeouts (flb_upstream.c:780)
==1654992== by 0x17FEFC: cb_thread_sched_timer (flb_output_thread.c:58)
==1654992== by 0x193ED7: flb_sched_event_handler (flb_scheduler.c:422)
==1654992== by 0x180672: output_thread (flb_output_thread.c:265)
==1654992== by 0x199602: step_callback (flb_worker.c:44)
==1654992== by 0x484E8AA: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
==1654992== by 0x4E3F926: start_thread (pthread_create.c:435)
==1654992== by 0x4ECF9E3: clone (clone.S:100)

The following patch fix the behavior on prepare_destroy_conn_safe by 'trying to acquire'
the mutex lock, if it fails to acquire it, it will asssume it's already locked and no
new lock is required.

Signed-off-by: Eduardo Silva [email protected]


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • Documentation required for this feature

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

When workers are enabled and a timeout occurs in a connection most of
cases a deadlock is held in the active worker:

  ==1654992== Thread #4: Attempt to re-lock a non-recursive lock I already hold
  ==1654992==    at 0x484BB44: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
  ==1654992==    by 0x197579: prepare_destroy_conn_safe (flb_upstream.c:435)
  ==1654992==    by 0x197887: create_conn (flb_upstream.c:533)
  ==1654992==    by 0x197DBB: flb_upstream_conn_get (flb_upstream.c:674)
  ==1654992==    by 0x2396D3: http_post (http.c:86)
  ==1654992==    by 0x23A5E5: cb_http_flush (http.c:338)
  ==1654992==    by 0x17FE6B: output_pre_cb_flush (flb_output.h:511)
  ==1654992==    by 0x503DAA: co_init (amd64.c:117)
  ==1654992==  Lock was previously acquired
  ==1654992==    at 0x484BC0F: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
  ==1654992==    by 0x19815F: flb_upstream_conn_timeouts (flb_upstream.c:780)
  ==1654992==    by 0x17FEFC: cb_thread_sched_timer (flb_output_thread.c:58)
  ==1654992==    by 0x193ED7: flb_sched_event_handler (flb_scheduler.c:422)
  ==1654992==    by 0x180672: output_thread (flb_output_thread.c:265)
  ==1654992==    by 0x199602: step_callback (flb_worker.c:44)
  ==1654992==    by 0x484E8AA: ??? (in /usr/libexec/valgrind/vgpreload_helgrind-amd64-linux.so)
  ==1654992==    by 0x4E3F926: start_thread (pthread_create.c:435)
  ==1654992==    by 0x4ECF9E3: clone (clone.S:100)

The following patch fix the behavior on prepare_destroy_conn_safe by 'trying to acquire'
the mutex lock, if it fails to acquire it, it will asssume it's already locked and no
new lock is required.

Signed-off-by: Eduardo Silva <[email protected]>
@edsiper edsiper marked this pull request as ready for review November 27, 2021 01:04
@edsiper edsiper added backport to v1.8.x Used to tag items that must be backported to such version. and removed docs-required labels Nov 29, 2021
@edsiper edsiper merged commit bf0f0d2 into master Nov 29, 2021
@lecaros lecaros added this to the Fluent Bit v1.8.12 milestone Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport to v1.8.x Used to tag items that must be backported to such version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants