Brick process crashed when global thread pool is enabled #3527

rafikc30 · 2022-05-15T23:12:55Z

Description of problem:
We have multiple crashes when global thread pool is enabled. We enabled the threadpool on the brick side with 16 threads on the pool. We have also disabled the io-threads.

All the crashes are pointing to a single back trace , that is

#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fc8245a9a01 in __GI_abort () at abort.c:79
#2 0x00007fc8245a0a1a in __assert_fail_base (fmt=0x7fc8246f3e38 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7fc8262f059d "node->next == NULL",
file=file@entry=0x7fc8262f0970 "/usr/include/urcu/static/wfstack.h", line=line@entry=145, function=function@entry=0x7fc8262f0e78 <PRETTY_FUNCTION.4689> "_cds_wfs_push") at assert.c:92
#3 0x00007fc8245a0a92 in __GI___assert_fail (assertion=assertion@entry=0x7fc8262f059d "node->next == NULL", file=file@entry=0x7fc8262f0970 "/usr/include/urcu/static/wfstack.h", line=line@entry=145,
function=function@entry=0x7fc8262f0e78 <PRETTY_FUNCTION.4689> "_cds_wfs_push") at assert.c:101
#4 0x00007fc82622f453 in _cds_wfs_push (node=, u_stack=...) at /usr/include/urcu/static/wfstack.h:145
#5 0x00007fc8262b7f69 in _cds_wfs_push (node=, u_stack=...) at async.c:287
#6 gf_async_worker_create () at async.c:340
#7 0x00007fc8262b8022 in gf_async_worker_enable () at async.c:355
#8 gf_async_leader_run () at async.c:437
#9 gf_async_worker (arg=0x7fc8265250b0 <gf_async_workers+432>) at async.c:529
#10 0x00007fc824e184f9 in start_thread (arg=0x7fc7fe7fc700) at pthread_create.c:465
#11 0x00007fc82466aecf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The exact command to reproduce the issue:

The full output of the command that failed:

Expected results:

Mandatory info:
- The output of the gluster volume info command:

- The output of the gluster volume status command:

- The output of the gluster volume heal command:

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

Additional info:

- The operating system / glusterfs version:
glusterfs-8.6

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

The text was updated successfully, but these errors were encountered:

rafikc30 · 2022-05-15T23:28:40Z

@xhernandez
I went through the code and I have doubts regarding the function gf_async_worker_create, I assume that it should only be run within one thread, primarily when we have used the function __cds_wfs_pop_blocking.
But gf_async_sigbroadcast was sent before gf_async_worker_create, so Isn't it theoretically possible to run the gf_async_worker_create from two or more threads, ie one is the thread that is just executed the signal and the second one from the new leader thread(provided that they have completed the execution so quickly)? So I was thinking the non-protected pop is causing the problem, I must say I have not deeply read the code and I could be wrong.

So If my assumption is wrong, In general, do you see why the cds_wfs_push is asserted with node->next being non-null. I think it could only happen when somebody else has done the cds_wfs_push from a different thread and that can only be the terminating thread. Is there any chance to have a race between them?

xhernandez · 2022-05-23T07:55:28Z

I needed to refresh my memory before fully understanding the issue (this code was wrote long ago). I think you are right. The order in which wake and create are executed is not good because there's a race that could cause that two threads try to create another thread at the same time, and both use a non-thread-safe function.

However this was done this way to reduce the latency of the jobs pending in the queue. Creating the thread before waking another one will increase the latency. The mistake here was using a shared stack to keep already created threads.

I need to figure out how to solve this. However, util I found a better solution, I think your patch is ok. I'll approve it.

We were sending the signal to transfer the leader before we create a new leader. Which means, there is a potential chance that the new thread creation will be executed twice which is not what we are intended to perform Fixes: gluster#3527 Change-Id: I8446ae3130894f351e6bd1197f26ce97d8cb1eef Signed-off-by: Mohammed Rafi KC <[email protected]>

We were sending the signal to transfer the leader before we create a new leader. Which means, there is a potential chance that the new thread creation will be executed twice which is not what we are intended to perform Fixes: #3527 Change-Id: I8446ae3130894f351e6bd1197f26ce97d8cb1eef Signed-off-by: Mohammed Rafi KC <[email protected]>

rafikc30 mentioned this issue May 21, 2022

global-threadpool: Create new threads within the leader #3540

Merged

xhernandez mentioned this issue May 26, 2022

FUSE crashes with --global-threading #3389

Closed

xhernandez closed this as completed in #3540 May 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brick process crashed when global thread pool is enabled #3527

Brick process crashed when global thread pool is enabled #3527

rafikc30 commented May 15, 2022

rafikc30 commented May 15, 2022 •

edited

Loading

xhernandez commented May 23, 2022

Brick process crashed when global thread pool is enabled #3527

Brick process crashed when global thread pool is enabled #3527

Comments

rafikc30 commented May 15, 2022

rafikc30 commented May 15, 2022 • edited Loading

xhernandez commented May 23, 2022

rafikc30 commented May 15, 2022 •

edited

Loading