-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core dumps on Gluster 9 - 3 replicas #2443
Comments
#0 list_del_init (old=0x7f81000dff68) at ../../../../libglusterfs/src/glusterfs/list.h:82 |
@AlexNinaber Let me try to recreate this issue and get back to you. |
@AlexNinaber What is the application you are running on the mount when you hit the issue. Locking pattern is important to recreate the issue. It would be great to have this info too before I try to recreate the issue. |
@pranithk it's not easy to give 1,2,3 step plan, there are multiple services running from Gluster: mongo, dhcp, slurm. I've added a longer wait before failover (i.e until the node can't ping anymore), it's not core dumping. However, what remains ever since using version 9 is the time for files to be repaired, if at all. Sometimes I have to restart the services as otherwise it just doesn't seem to be healed. Size is relatively small of this volume, 100M or so. Starting from a healthy volume 3 replica, rebooting one replica I get this: [root@master02 ~]# gluster volume heal local info Brick 10.141.255.253:/gluster/local Brick 10.141.11.1:/glusterssd/local It doesn't go down. And I see a lot of this on all 3 bricks: [2021-05-19 13:29:18.080718 +0000] I [MSGID: 108026] [afr-self-heal-entry.c:1053:afr_selfheal_entry_do] 2-local-replicate-0: performing entry selfheal on 872ebcb8-5b86-4dc5-aac6-7bdd016a186f I've added more strict quorum settings and other stuff but that doesn't seem to help: Volume Name: local |
Dear @pranithk, again core dump, but now without rebooting anything in the 3x replica; after 2 hours fuse mount entirely gone. #0 list_del_init (old=0x7fb38418af58) at ../../../../libglusterfs/src/glusterfs/list.h:82 |
@AlexNinaber Could you do |
@AlexNinaber I found one place where it could be NULL. Will it be possible for you to test the patch to see if this is the only place where the issue is present? |
fixes: gluster#2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
@AlexNinaber #2457 Could you test this issue with the RPMs generated and let us know if this fixes the issue? |
@pranithk Happy to try the rpm, where can I find it? |
@AlexNinaber Which frame did you try it in? Could you do this in frame-1 |
@pranithk this is gdb on the core, lock is optimized out so not immediately clear to me if putting a break in would help really. |
You don't need to put a break point. In gdb: Do: If you are not on slack, can you join using the slack invite in https://www.gluster.org/community/? |
(gdb) p lock I'll get slack |
Cool, this confirms the theory. |
Check under |
fixes: gluster#2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
fixes: gluster#2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
fixes: gluster#2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
fixes: #2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
fixes: gluster#2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
fixes: gluster#2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
fixes: gluster#2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
fixes: #2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
fixes: #2443 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: Pranith Kumar K <[email protected]>
> Upstream patch: gluster@00761df > fixes: gluster#2443 > Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 > Signed-off-by: Pranith Kumar K <[email protected]> BUG: 1689375 Change-Id: I86ef0270d41d6fb924db97fde3196d7c98c8b564 Signed-off-by: karthik-us <[email protected]> Reviewed-on: https://code.engineering.redhat.com/gerrit/c/rhs-glusterfs/+/245613 Tested-by: RHGS Build Bot <[email protected]> Reviewed-by: Ravishankar Narayanankutty <[email protected]> Reviewed-by: Sunil Kumar Heggodu Gopala Acharya <[email protected]>
Description of problem:
A number of services use a 3 replicated gluster mount, rebooting one of the replicas always results in a core dump on the machine taking over the services. Once coredumped, the directory shows Socket not connected.
How to trigger: rebooting one of the replicas. Recreated the volume from scratch, still same problem. The services in HA might hold the mounted gluster for a reasonably long time and a smooth umount might not occur. However, in 3 replica this shouldn't matter.
gdb /usr/sbin/glusterfs -c core.3549
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-120.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/...
Reading symbols from /usr/sbin/glusterfsd...Reading symbols from /usr/sbin/glusterfsd...(no debugging symbols found)...done.
(no debugging symbols found)...done.
warning: core file may not match specified executable file.
[New LWP 3605]
[New LWP 3551]
[New LWP 3553]
[New LWP 3549]
[New LWP 3557]
[New LWP 3556]
[New LWP 3558]
[New LWP 3604]
[New LWP 3648]
[New LWP 3559]
[New LWP 3647]
[New LWP 3646]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=10.141.255.254 --volfi'.
Program terminated with signal 11, Segmentation fault.
#0 0x00007f27bb5a985b in __insert_and_merge () from /usr/lib64/glusterfs/9.2/xlator/protocol/client.so
Missing separate debuginfos, use: debuginfo-install glusterfs-fuse-9.2-1.el7.x86_64
Mandatory info:
- The output of the
gluster volume info
command:gluster volume info local
Volume Name: local
Type: Replicate
Volume ID: 04e9d8b5-2225-46c2-bcd2-78356e0581f1
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.141.255.254:/gluster/local
Brick2: 10.141.255.253:/gluster/local
Brick3: 10.141.11.1:/glusterssd/local
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
performance.write-behind: off
performance.flush-behind: off
cluster.granular-entry-heal: enable
(core dump error is also there without the granular)
- The output of the
gluster volume status
command:gluster volume status local
Status of volume: local
Gluster process TCP Port RDMA Port Online Pid
Brick 10.141.255.254:/gluster/local 49154 0 Y 2680
Brick 10.141.255.253:/gluster/local 49154 0 Y 2757
Brick 10.141.11.1:/glusterssd/local 49154 0 Y 6150
Self-heal Daemon on localhost N/A N/A Y 2693
Self-heal Daemon on 10.141.11.1 N/A N/A Y 6196
Self-heal Daemon on 10.141.255.253 N/A N/A Y 2770
Task Status of Volume local
There are no active volume tasks
- The output of the
gluster volume heal
command:gluster volume heal local
Launching heal operation to perform index self heal on volume local has been successful
Use heal info commands to check status.
Will not solve it
gluster volume heal local info
Brick 10.141.255.254:/gluster/local
Status: Connected
Number of entries: 0
Brick 10.141.255.253:/gluster/local
Status: Connected
Number of entries: 0
Brick 10.141.11.1:/glusterssd/local
Status: Connected
Number of entries: 0
Socket still disconnected
**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/
The message "W [MSGID: 114061] [client-common.c:2895:client_pre_lk_v2] 0-local-client-1: remote_fd is -1. EBADFD [{gfid=c8963045-a4b5-4dd6-b794-7ea4acb6614d}, {errno=77}, {error=File descriptor in
bad state}]" repeated 37 times between [2021-05-17 16:57:50.016590 +0000] and [2021-05-17 16:57:50.106029 +0000]
pending frames:
frame : type(1) op(LOOKUP)
frame : type(1) op(LK)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
frame : type(1) op(FSYNC)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2021-05-17 16:57:50 +0000
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 9.2
/lib64/libglusterfs.so.0(+0x28d2f)[0x7f9b7b6dcd2f]
/lib64/libglusterfs.so.0(gf_print_trace+0x36a)[0x7f9b7b6e7dba]
/lib64/libc.so.6(+0x36400)[0x7f9b79912400]
/usr/lib64/glusterfs/9.2/xlator/protocol/client.so(+0x3b85b)[0x7f9b6c52d85b]
/usr/lib64/glusterfs/9.2/xlator/protocol/client.so(+0x3cc00)[0x7f9b6c52ec00]
/usr/lib64/glusterfs/9.2/xlator/protocol/client.so(+0x5903d)[0x7f9b6c54b03d]
/lib64/libgfrpc.so.0(+0xf7f1)[0x7f9b7b4867f1]
/lib64/libgfrpc.so.0(+0xfb65)[0x7f9b7b486b65]
/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7f9b7b483133]
/usr/lib64/glusterfs/9.2/rpc-transport/socket.so(+0x4418)[0x7f9b6f23b418]
/usr/lib64/glusterfs/9.2/rpc-transport/socket.so(+0x9d21)[0x7f9b6f240d21]
/lib64/libglusterfs.so.0(+0x8e13c)[0x7f9b7b74213c]
/lib64/libpthread.so.0(+0x7ea5)[0x7f9b7a114ea5]
/lib64/libc.so.6(clone+0x6d)[0x7f9b799da9fd]
**- Is there any crash ? Provide the backtrace and coredump
I don't see the debug rpm in the repo?
Additional info:
- The operating system / glusterfs version:
9.2 rpm repo,
The text was updated successfully, but these errors were encountered: