Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brick process crashed when global thread pool is enabled #3527

Closed
rafikc30 opened this issue May 15, 2022 · 2 comments · Fixed by #3540
Closed

Brick process crashed when global thread pool is enabled #3527

rafikc30 opened this issue May 15, 2022 · 2 comments · Fixed by #3540

Comments

@rafikc30
Copy link
Member

Description of problem:
We have multiple crashes when global thread pool is enabled. We enabled the threadpool on the brick side with 16 threads on the pool. We have also disabled the io-threads.

All the crashes are pointing to a single back trace , that is

#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fc8245a9a01 in __GI_abort () at abort.c:79
#2 0x00007fc8245a0a1a in __assert_fail_base (fmt=0x7fc8246f3e38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7fc8262f059d "node->next == NULL",
file=file@entry=0x7fc8262f0970 "/usr/include/urcu/static/wfstack.h", line=line@entry=145, function=function@entry=0x7fc8262f0e78 <PRETTY_FUNCTION.4689> "_cds_wfs_push") at assert.c:92
#3 0x00007fc8245a0a92 in __GI___assert_fail (assertion=assertion@entry=0x7fc8262f059d "node->next == NULL", file=file@entry=0x7fc8262f0970 "/usr/include/urcu/static/wfstack.h", line=line@entry=145,
function=function@entry=0x7fc8262f0e78 <PRETTY_FUNCTION.4689> "_cds_wfs_push") at assert.c:101
#4 0x00007fc82622f453 in _cds_wfs_push (node=, u_stack=...) at /usr/include/urcu/static/wfstack.h:145
#5 0x00007fc8262b7f69 in _cds_wfs_push (node=, u_stack=...) at async.c:287
#6 gf_async_worker_create () at async.c:340
#7 0x00007fc8262b8022 in gf_async_worker_enable () at async.c:355
#8 gf_async_leader_run () at async.c:437
#9 gf_async_worker (arg=0x7fc8265250b0 <gf_async_workers+432>) at async.c:529
#10 0x00007fc824e184f9 in start_thread (arg=0x7fc7fe7fc700) at pthread_create.c:465
#11 0x00007fc82466aecf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The exact command to reproduce the issue:

The full output of the command that failed:

Expected results:

Mandatory info:
- The output of the gluster volume info command:

- The output of the gluster volume status command:

- The output of the gluster volume heal command:

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

Additional info:

- The operating system / glusterfs version:
glusterfs-8.6

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

@rafikc30
Copy link
Member Author

rafikc30 commented May 15, 2022

@xhernandez
I went through the code and I have doubts regarding the function gf_async_worker_create, I assume that it should only be run within one thread, primarily when we have used the function __cds_wfs_pop_blocking.
But gf_async_sigbroadcast was sent before gf_async_worker_create, so Isn't it theoretically possible to run the gf_async_worker_create from two or more threads, ie one is the thread that is just executed the signal and the second one from the new leader thread(provided that they have completed the execution so quickly)? So I was thinking the non-protected pop is causing the problem, I must say I have not deeply read the code and I could be wrong.

So If my assumption is wrong, In general, do you see why the cds_wfs_push is asserted with node->next being non-null. I think it could only happen when somebody else has done the cds_wfs_push from a different thread and that can only be the terminating thread. Is there any chance to have a race between them?

@xhernandez
Copy link
Contributor

I needed to refresh my memory before fully understanding the issue (this code was wrote long ago). I think you are right. The order in which wake and create are executed is not good because there's a race that could cause that two threads try to create another thread at the same time, and both use a non-thread-safe function.

However this was done this way to reduce the latency of the jobs pending in the queue. Creating the thread before waking another one will increase the latency. The mistake here was using a shared stack to keep already created threads.

I need to figure out how to solve this. However, util I found a better solution, I think your patch is ok. I'll approve it.

rafikc30 added a commit to rafikc30/glusterfs that referenced this issue May 30, 2022
We were sending the signal to transfer the leader before
we create a new leader. Which means, there is a potential
chance that the new thread creation will be executed twice
which is not what we are intended to perform

Fixes: gluster#3527
Change-Id: I8446ae3130894f351e6bd1197f26ce97d8cb1eef
Signed-off-by: Mohammed Rafi KC <[email protected]>
xhernandez pushed a commit that referenced this issue May 31, 2022
We were sending the signal to transfer the leader before
we create a new leader. Which means, there is a potential
chance that the new thread creation will be executed twice
which is not what we are intended to perform

Fixes: #3527
Change-Id: I8446ae3130894f351e6bd1197f26ce97d8cb1eef
Signed-off-by: Mohammed Rafi KC <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants