Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug:1802947] list about 550 files in replicated volume will causes glfs_iotwr thread crash #978

Closed
gluster-ant opened this issue Mar 12, 2020 · 7 comments
Assignees
Labels
Migrated Type:Bug wontfix Managed by stale[bot]

Comments

@gluster-ant
Copy link
Collaborator

URL: https://bugzilla.redhat.com/1802947
Creator: liguang_li at 126
Time: 20200214T08:29:13

Description of problem:

About 550 files in the replicated volume, run ls to list the files will cause the glfs_iotwr thread stack overflow.

Version-Release number of selected component (if applicable):
v6.4

How reproducible:

Steps to Reproduce:

  1. Create a replicated volume
  2. Mount the replicated volume
  3. Touch about 550 files in the replicated volume
  4. Run "ls" command

Actual results:

[ 296.815617] glfs_iotwr000[626]: bad frame in setup_rt_frame: 00003fff76d7a720 nip 00003fff80f5a1c4 lr 00003fff81019c74

Expected results:

List all the files in the replicated volume

Additional info:

Core was generated by `/usr/sbin/glusterfsd -s 128.224.95.141 --volfile-id gv0.128.224.95.141.tmp-bric'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00003fff7f89c1c4 in _int_free (av=0x3fff68000020, p=0x3fff68025820, have_lock=0) at malloc.c:3846
3846 {
[Current thread is 1 (Thread 0x3fff7970c440 (LWP 1265))]
#0 0x00003fff7f89c1c4 in _int_free (av=0x3fff68000020, p=0x3fff68025820, have_lock=0) at malloc.c:3846
#1 0x00003fff7f95bc74 in x_inline (xdrs=, len=) at xdr_sizeof.c:88
#2 0x00003fff7fa394e8 in .xdr_gfx_iattx () from /usr/lib64/libgfxdr.so.0
#3 0x00003fff7fa39ee4 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#4 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8a80, size=, proc=) at xdr_ref.c:84
#5 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8a80, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#6 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#7 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8900, size=, proc=) at xdr_ref.c:84
#8 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8900, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#9 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#10 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8780, size=, proc=) at xdr_ref.c:84
#11 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8780, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#12 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#13 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8600, size=, proc=) at xdr_ref.c:84
#14 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8600, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#15 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#16 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8480, size=, proc=) at xdr_ref.c:84
#17 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8480, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#18 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#19 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8300, size=, proc=) at xdr_ref.c:84
#20 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8300, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#21 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#22 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680e8180, size=, proc=) at xdr_ref.c:84
#23 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680e8180, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>) at xdr_ref.c:135
#24 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
...
#1611 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#1612 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff680b6e00, size=, proc=) at xdr_ref.c:84
#1613 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff680b6e00, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>)
at xdr_ref.c:135
#1614 0x00003fff7fa39f20 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
---Type to continue, or q to quit---
#1615 0x00003fff7f95b8d8 in __GI_xdr_reference (xdrs=0x3fff7970a040, pp=0x3fff7970a300, size=, proc=) at xdr_ref.c:84
#1616 0x00003fff7f95bab4 in __GI_xdr_pointer (xdrs=0x3fff7970a040, objpp=0x3fff7970a300, obj_size=, xdr_obj=@0x3fff7fa56670: 0x3fff7fa39e20 <.xdr_gfx_dirplist>)
at xdr_ref.c:135
#1617 0x00003fff7fa3e4d8 in .xdr_gfx_readdirp_rsp () from /usr/lib64/libgfxdr.so.0
#1618 0x00003fff7f95bdd0 in __GI_xdr_sizeof (func=, data=) at xdr_sizeof.c:157
#1619 0x00003fff7a1d391c in gfs_serialize_reply () from /usr/lib64/glusterfs/6.4/xlator/protocol/server.so
#1620 0x00003fff7a1d3b78 in server_submit_reply () from /usr/lib64/glusterfs/6.4/xlator/protocol/server.so

@gluster-ant
Copy link
Collaborator Author

Time: 20200217T10:44:55
ravishankar at redhat commented:
Hi Liguang Li, I'm not able to reproduce this issue on v6.4. Here is what I did:

[root@vm2 glusterfs]# gluster --version
glusterfs 6.4

[root@vm2 glusterfs]# gluster vol create testvol replica 3 127.0.0.2:/bricks/brick{1..3} force
volume create: testvol: success: please start the volume to access data
[root@vm2 glusterfs]# gluster v start testvol
volume start: testvol: success
[root@vm2 glusterfs]# mount -t glusterfs 127.0.0.2:testvol /mnt/fuse_mnt
[root@vm2 glusterfs]# cd /mnt/fuse_mnt/
[root@vm2 fuse_mnt]# touch file{1..550}
[root@vm2 fuse_mnt]# ls|wc
550 550 4292

Am I missing something in the steps?
Are all your clients and servers on glusterfs 6.4? Was this a fresh install or did you upgrade from an earlier version?

@gluster-ant
Copy link
Collaborator Author

Time: 20200219T03:17:26
liguang_li at 126 commented:
This issue can reproduce easily on v6.4 as you steps.

root@128:/# gluster --version
glusterfs 6.4

root@128:/# gdb /usr/sbin/glusterfsd ./core.638
...
Core was generated by `/usr/sbin/glusterfsd -s 128.224.95.141 --volfile-id gv0.128.224.95.141.tmp-bric'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846
3846 {
[Current thread is 1 (Thread 0x3fff99390440 (LWP 648))]
(gdb) bt
#0 0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846
#1 0x00003fff9f5dfc74 in x_inline (xdrs=, len=) at xdr_sizeof.c:88
#2 0x00003fff9f6bd4e8 in .xdr_gfx_iattx () from /usr/lib64/libgfxdr.so.0
#3 0x00003fff9f6bdee4 in .xdr_gfx_dirplist () from /usr/lib64/libgfxdr.so.0
#4 0x00003fff9f5df8d8 in __GI_xdr_reference (xdrs=0x3fff9938e040, pp=0x3fff880eacf0, size=, proc=) at xdr_ref.c:84
#5 0x00003fff9f5dfab4 in __GI_xdr_pointer (xdrs=0x3fff9938e040, objpp=0x3fff880eacf0, obj_size=,
...
#1642 0x00003fff9f79a3d4 in .call_resume () from /usr/lib64/libglusterfs.so.0
#1643 0x00003fff9a07e948 in ?? () from /usr/lib64/glusterfs/6.4/xlator/performance/io-threads.so
#1644 0x00003fff9f654b30 in start_thread (arg=0x3fff99390440) at pthread_create.c:462
(gdb) frame 1644
#1644 0x00003fff9f654b30 in start_thread (arg=0x3fff99390440) at pthread_create.c:462
462 THREAD_SETMEM (pd, result, pd->start_routine (pd->arg));
(gdb) p/x $r1
$1 = 0x3fff9938fa20
(gdb) frame 0
#0 0x00003fff9f5201c4 in _int_free (av=0x3fff88000020, p=0x3fff880092f0, have_lock=0) at malloc.c:3846
3846 {
(gdb) p/x $r1
$2 = 0x3fff99353080
(gdb) p $1 - $2
$3 = 248224
(gdb) disassemble
Dump of assembler code for function _int_free:
0x00003fff903f0160 <+0>: mflr r0
0x00003fff903f0164 <+4>: std r30,-16(r1)
0x00003fff903f0168 <+8>: std r0,16(r1)
0x00003fff903f016c <+12>: mfcr r12
0x00003fff903f0170 <+16>: std r29,-24(r1)
0x00003fff903f0174 <+20>: mr r29,r3
0x00003fff903f0178 <+24>: std r31,-8(r1)
0x00003fff903f017c <+28>: mr r31,r4
0x00003fff903f0180 <+32>: ld r10,8(r4)
0x00003fff903f0184 <+36>: std r17,-120(r1)
0x00003fff903f0188 <+40>: std r18,-112(r1)
0x00003fff903f018c <+44>: rldicr r30,r10,0,60
0x00003fff903f0190 <+48>: std r19,-104(r1)
0x00003fff903f0194 <+52>: neg r9,r30
0x00003fff903f0198 <+56>: std r20,-96(r1)
0x00003fff903f019c <+60>: cmpld cr7,r4,r9
0x00003fff903f01a0 <+64>: std r21,-88(r1)
0x00003fff903f01a4 <+68>: std r22,-80(r1)
0x00003fff903f01a8 <+72>: std r23,-72(r1)
0x00003fff903f01ac <+76>: std r24,-64(r1)
0x00003fff903f01b0 <+80>: std r25,-56(r1)
0x00003fff903f01b4 <+84>: std r26,-48(r1)
0x00003fff903f01b8 <+88>: std r27,-40(r1)
0x00003fff903f01bc <+92>: std r28,-32(r1)
0x00003fff903f01c0 <+96>: stw r12,8(r1)
=> 0x00003fff903f01c4 <+100>: stdu r1,-256(r1)

Please notes, we are using a powerpc machine. From the stack pointer register in frame 1644 and 0, we know 248224 bytes have been used in the stack of the thread.

From the assemble instructions, we know the crash happens in the "stdu r1,-256(r1)" instruction, so i guess there is a stack overflow.

We know the stack size of the thread is 256K from the source code, can i fix this crash by increasing the stack size.

@gluster-ant
Copy link
Collaborator Author

Time: 20200224T12:49:46
ravishankar at redhat commented:
(In reply to Liguang Li from comment #2)

We know the stack size of the thread is 256K from the source code, can i fix
this crash by increasing the stack size.

IOT_THREAD_STACK_SIZE (256) must be enough ideally. What is surprising is that you are having such a long call stack with 1644 frames, almost as if there is some bug due to looping. Can you attach the entire backtrace of the thread that crashed?

Please notes, we are using a powerpc machine. F
Hmm, I tried setting up a Fedora 31 ppc64le virtual machine using qemu on my x86_64 Fedora laptop but I'm having trouble getting it to boot :(.

@gluster-ant
Copy link
Collaborator Author

Time: 20200228T08:47:53
liguang_li at 126 commented:
Created attachment 1666340
full backtrace of glusterfsd

@gluster-ant
Copy link
Collaborator Author

Time: 20200228T08:49:55
liguang_li at 126 commented:

IOT_THREAD_STACK_SIZE (256) must be enough ideally. What is surprising is that you are having such a long call stack with 1644 frames, almost as if there is some bug due to looping. Can you attach the entire backtrace of the thread that crashed?

Changing IOT_THREAD_STACK_SIZE from 256 to 512, the test works. And then increasing the files in gluster volume, no crash occurs again.

@stale
Copy link

stale bot commented Oct 8, 2020

Thank you for your contributions.
Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity.
It will be closed in 2 weeks if no one responds with a comment here.

@stale stale bot added the wontfix Managed by stale[bot] label Oct 8, 2020
@stale
Copy link

stale bot commented Oct 23, 2020

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

@stale stale bot closed this as completed Oct 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Migrated Type:Bug wontfix Managed by stale[bot]
Projects
None yet
Development

No branches or pull requests

2 participants