Segmentation fault in gluster client #4271

SowjanyaKotha · 2023-11-27T07:18:23Z

Description of problem:
Setup of 2 node mirrored volumes with clients installed on both nodes. When one of the node becomes faulty, the node is removed and replaced with a new node with the same name/IP. While adding brick, the active client crashes. The issue occurs randomly when ssl is enabled on IO. It is not seen in non-ssl setups.

The exact command to reproduce the issue:
gluster volume add-brick efa_logs replica 2 10.18.120.135:/apps/opt/efa/logs force

The full output of the command that failed:

Expected results:
add-brick should be successful

Mandatory info:
- The output of the gluster volume info command:

Status of volume: efa_certs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.18.120.136:/apps/opt/efa/certs     52847     0          Y       34686
Brick 10.18.120.135:/apps/opt/efa/certs     54321     0          Y       33999
Self-heal Daemon on localhost               N/A       N/A        Y       150192
Self-heal Daemon on 10.18.120.135           N/A       N/A        Y       34015

Task Status of Volume efa_certs
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: efa_logs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.18.120.136:/apps/opt/efa/logs      56910     0          Y       34750
Brick 10.18.120.135:/apps/opt/efa/logs      56796     0          Y       34064
Self-heal Daemon on localhost               N/A       N/A        Y       150192
Self-heal Daemon on 10.18.120.135           N/A       N/A        Y       34015

Task Status of Volume efa_logs
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: efa_misc
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.18.120.136:/apps/opt/efa/misc      55691     0          Y       34799
Brick 10.18.120.135:/apps/opt/efa/misc      58871     0          Y       34167
Self-heal Daemon on localhost               N/A       N/A        Y       150192
Self-heal Daemon on 10.18.120.135           N/A       N/A        Y       34015

Task Status of Volume efa_misc
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume status command:

Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.18.120.135:/apps/opt/efa/logs
Brick2: 10.18.120.136:/apps/opt/efa/logs
Options Reconfigured:
ssl.ca-list: /apps/efadata/glusterfs/glusterfs.extreme-ca-chain.pem
ssl.own-cert: /apps/efadata/glusterfs/glusterfs.pem
ssl.private-key: /apps/efadata/glusterfs/glusterfs.key.pem
ssl.cipher-list: HIGH:!SSLv2:!SSLv3:!TLSv1:!TLSv1.1:TLSv1.2:!3DES:!RC4:!aNULL:!ADH
auth.ssl-allow: 10.18.120.135,10.18.120.136
server.ssl: on
client.ssl: on
ssl.certificate-depth: 3
network.ping-timeout: 2
performance.open-behind: on
cluster.favorite-child-policy: mtime
storage.owner-gid: 1001
storage.owner-uid: 0
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

- The output of the gluster volume heal command:

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

(gdb) bt
#0  0x00007fa6f731bbad in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#1  0x00007fa6f731fe1e in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#2  0x00007fa6f731d6d0 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#3  0x00007fa6f7324c45 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#4  0x00007fa6f732fa3f in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#5  0x00007fa6f732fb47 in SSL_read () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#6  0x00007fa6f739dc94 in ssl_do (buf=<optimized out>, len=<optimized out>, func=<optimized out>, priv=<optimized out>, priv=<optimized out>) at socket.c:246
#7  0x00007fa6f739de36 in __socket_ssl_readv (opvector=opvector@entry=0x7fa6f6abedd0, opcount=opcount@entry=1, this=<optimized out>, this=<optimized out>) at socket.c:552
#8  0x00007fa6f739e35b in __socket_ssl_read (count=<optimized out>, buf=<optimized out>, this=0x555685ba1b98) at socket.c:572
#9  __socket_cached_read (opcount=1, opvector=0x555685699338, this=0x555685ba1b98) at socket.c:610
#10 __socket_rwv (this=this@entry=0x555685ba1b98, vector=<optimized out>, count=count@entry=1, pending_vector=pending_vector@entry=0x5556856993a8, pending_count=pending_count@entry=0x5556856993b4, bytes=bytes@entry=0x7fa6f6abeea0,
    write=0) at socket.c:721
#11 0x00007fa6f73a0438 in __socket_readv (bytes=0x7fa6f6abeea0, pending_count=0x5556856993b4, pending_vector=0x5556856993a8, count=1, vector=<optimized out>, this=0x555685ba1b98) at socket.c:2102
#12 __socket_read_frag (this=0x555685ba1b98) at socket.c:2102
#13 socket_proto_state_machine (pollin=<synthetic pointer>, this=0x555685ba1b98) at socket.c:2262
#14 socket_event_poll_in (notify_handled=true, this=0x555685ba1b98) at socket.c:2384
#15 socket_event_handler (event_thread_died=0, poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x555685ba1b98, gen=13, idx=2, fd=<optimized out>) at socket.c:2790
#16 socket_event_handler (fd=fd@entry=6, idx=idx@entry=2, gen=gen@entry=13, data=data@entry=0x555685ba1b98, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0) at socket.c:2710
#17 0x00007fa6fbade119 in event_dispatch_epoll_handler (event=0x7fa6f6abf054, event_pool=0x555685006018) at event-epoll.c:614
#18 event_dispatch_epoll_worker (data=0x555685036828) at event-epoll.c:725
#19 0x00007fa6fb9fa609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007fa6fb74b133 in clone () from /lib/x86_64-linux-gnu/libc.so.6
 
(gdb) f 5
#5  0x00007fa6f732fb47 in SSL_read () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
(gdb) info locals
No symbol table info available.
(gdb) f 9
#9  __socket_cached_read (opcount=1, opvector=0x555685699338, this=0x555685ba1b98) at socket.c:610
610     socket.c: No such file or directory.
(gdb) info ocals
Undefined info command: "ocals".  Try "help info".
(gdb) info locals
ret = -1
priv = 0x555685699218
in = 0x555685699318
req_len = 8
priv = <optimized out>
in = <optimized out>
req_len = <optimized out>
ret = <optimized out>
(gdb) l
605     in socket.c
(gdb) f 7
#7  0x00007fa6f739de36 in __socket_ssl_readv (opvector=opvector@entry=0x7fa6f6abedd0, opcount=opcount@entry=1, this=<optimized out>, this=<optimized out>) at socket.c:552
552     in socket.c
(gdb) info locals
priv = 0x555685699218
sock = <optimized out>
ret = -1
__FUNCTION__ = "__socket_ssl_readv"
(gdb) f 15
#15 socket_event_handler (event_thread_died=0, poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x555685ba1b98, gen=13, idx=2, fd=<optimized out>) at socket.c:2790
2790    in socket.c
(gdb) l
2785    in socket.c
(gdb) info locals
this = <optimized out>
ret = <optimized out>
ctx = <optimized out>
notify_handled = <optimized out>
priv = 0x555685699218
socket_closed = <optimized out>
this = <optimized out>
priv = <optimized out>
ret = <optimized out>
ctx = <optimized out>
socket_closed = <optimized out>
notify_handled = <optimized out>
__FUNCTION__ = "socket_event_handler"
sock_type = <optimized out>
sa = <optimized out>
(gdb)

Additional info:

- The operating system / glusterfs version:
It is reproducible with gluster version 9.6 and 11.0 on Ubuntu setup installed with Debian files.

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

The text was updated successfully, but these errors were encountered:

samirsss · 2024-02-07T15:08:42Z

Bump on this one to see if there is a solution

samirsss · 2024-02-07T15:26:34Z

@amarts @avati - can you please point us in the right direction so that we can proceed. segfaults are not typical and hence wondering why this is being ignored.

aravindavk · 2024-02-07T15:39:56Z

I will look into this and update.

Is add-brick command failing or mount is failed after the add-brick command?
Why not used reset-brick command if the same brick is replaced?

samirsss · 2024-02-07T17:09:11Z

Thanks @aravindavk - @SowjanyaKotha will reply on this. Really appreciate the quick response here 👍

SowjanyaKotha · 2024-02-08T07:29:10Z

@aravindavk The fault on the existing node volume happens at different times. add-brick is on such case(most cases), It can happen at remove-brick as well.
When the node is replaced, the new node is clean and gluster packages are installed. The node is offline before the remove-brick is done. So, didn't use reset-brick.

samirsss · 2024-02-13T15:38:39Z

@aravindavk any updates on this? We're hitting this issue consistently after a few attempts and hence pushing for a solution

samirsss · 2024-02-16T06:09:49Z

@amarts @avati seems like support for the project is lacking now. Can someone help please.

aravindavk · 2024-02-16T08:05:18Z

From the backtrace, I can see that SSL_read is crashed.

What were the steps used to setup new node and the existing nodes (Clients and Servers)?

New SSL key generated in the new node (used in add-brick command) or SSL key file is reused from the existing node that is replaced?

If cleanup is not done to /usr/lib/ssl/glusterfs.ca file, then delete this file or find the old node's certificate from this file and add the new node's details.

aravindavk · 2024-02-16T12:30:13Z

I tested this in our lab, couldn't reproduce the crash. The steps I did were:

Generate SSL key and certificate in two nodes (server1.gluster and server2.gluster)
Create two node Cluster. Peer probe server2.gluster from server1.gluster
Create a Replica 2 volume
Mount the volume in server1.gluster
Simulate the node failure (server2.gluster)
Setup a new node with same hostname server2.gluster
Create SSL key and certificate in the new node
Regenerate the glusterfs.ca file and copy to all nodes and clients
Start Glusterd in the new node
Run peer probe again
Run add-brick command to re-add the brick.
Verify the existing mount by creating a new file

The details about the tests are available here:

https://github.com/aravindavk/gluster-tests?tab=readme-ov-file#gluster-tls-with-node-replacement-test

SowjanyaKotha · 2024-02-16T14:16:07Z

@aravindavk A new certificate is created for the node. But the issue happens randomly. If the certificate is not correct, it should always fail. Would it matter if the cert location is not the default one?

pranithk assigned sanjurakonde Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault in gluster client #4271

Segmentation fault in gluster client #4271

SowjanyaKotha commented Nov 27, 2023 •

edited by aravindavk

Loading

samirsss commented Feb 7, 2024

samirsss commented Feb 7, 2024

aravindavk commented Feb 7, 2024

samirsss commented Feb 7, 2024

SowjanyaKotha commented Feb 8, 2024 •

edited

Loading

samirsss commented Feb 13, 2024

samirsss commented Feb 16, 2024

aravindavk commented Feb 16, 2024

aravindavk commented Feb 16, 2024

SowjanyaKotha commented Feb 16, 2024 •

edited

Loading

Segmentation fault in gluster client #4271

Segmentation fault in gluster client #4271

Comments

SowjanyaKotha commented Nov 27, 2023 • edited by aravindavk Loading

samirsss commented Feb 7, 2024

samirsss commented Feb 7, 2024

aravindavk commented Feb 7, 2024

samirsss commented Feb 7, 2024

SowjanyaKotha commented Feb 8, 2024 • edited Loading

samirsss commented Feb 13, 2024

samirsss commented Feb 16, 2024

aravindavk commented Feb 16, 2024

aravindavk commented Feb 16, 2024

SowjanyaKotha commented Feb 16, 2024 • edited Loading

SowjanyaKotha commented Nov 27, 2023 •

edited by aravindavk

Loading

SowjanyaKotha commented Feb 8, 2024 •

edited

Loading

SowjanyaKotha commented Feb 16, 2024 •

edited

Loading