Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in gluster client #4271

Open
SowjanyaKotha opened this issue Nov 27, 2023 · 10 comments
Open

Segmentation fault in gluster client #4271

SowjanyaKotha opened this issue Nov 27, 2023 · 10 comments
Assignees

Comments

@SowjanyaKotha
Copy link

SowjanyaKotha commented Nov 27, 2023

Description of problem:
Setup of 2 node mirrored volumes with clients installed on both nodes. When one of the node becomes faulty, the node is removed and replaced with a new node with the same name/IP. While adding brick, the active client crashes. The issue occurs randomly when ssl is enabled on IO. It is not seen in non-ssl setups.

The exact command to reproduce the issue:
gluster volume add-brick efa_logs replica 2 10.18.120.135:/apps/opt/efa/logs force

The full output of the command that failed:

Expected results:
add-brick should be successful

Mandatory info:
- The output of the gluster volume info command:

Status of volume: efa_certs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.18.120.136:/apps/opt/efa/certs     52847     0          Y       34686
Brick 10.18.120.135:/apps/opt/efa/certs     54321     0          Y       33999
Self-heal Daemon on localhost               N/A       N/A        Y       150192
Self-heal Daemon on 10.18.120.135           N/A       N/A        Y       34015

Task Status of Volume efa_certs
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: efa_logs
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.18.120.136:/apps/opt/efa/logs      56910     0          Y       34750
Brick 10.18.120.135:/apps/opt/efa/logs      56796     0          Y       34064
Self-heal Daemon on localhost               N/A       N/A        Y       150192
Self-heal Daemon on 10.18.120.135           N/A       N/A        Y       34015

Task Status of Volume efa_logs
------------------------------------------------------------------------------
There are no active volume tasks

Status of volume: efa_misc
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.18.120.136:/apps/opt/efa/misc      55691     0          Y       34799
Brick 10.18.120.135:/apps/opt/efa/misc      58871     0          Y       34167
Self-heal Daemon on localhost               N/A       N/A        Y       150192
Self-heal Daemon on 10.18.120.135           N/A       N/A        Y       34015

Task Status of Volume efa_misc
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume status command:

Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: 10.18.120.135:/apps/opt/efa/logs
Brick2: 10.18.120.136:/apps/opt/efa/logs
Options Reconfigured:
ssl.ca-list: /apps/efadata/glusterfs/glusterfs.extreme-ca-chain.pem
ssl.own-cert: /apps/efadata/glusterfs/glusterfs.pem
ssl.private-key: /apps/efadata/glusterfs/glusterfs.key.pem
ssl.cipher-list: HIGH:!SSLv2:!SSLv3:!TLSv1:!TLSv1.1:TLSv1.2:!3DES:!RC4:!aNULL:!ADH
auth.ssl-allow: 10.18.120.135,10.18.120.136
server.ssl: on
client.ssl: on
ssl.certificate-depth: 3
network.ping-timeout: 2
performance.open-behind: on
cluster.favorite-child-policy: mtime
storage.owner-gid: 1001
storage.owner-uid: 0
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

- The output of the gluster volume heal command:

**- Provide logs present on following locations of client and server nodes -
/var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

(gdb) bt
#0  0x00007fa6f731bbad in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#1  0x00007fa6f731fe1e in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#2  0x00007fa6f731d6d0 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#3  0x00007fa6f7324c45 in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#4  0x00007fa6f732fa3f in ?? () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#5  0x00007fa6f732fb47 in SSL_read () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
#6  0x00007fa6f739dc94 in ssl_do (buf=<optimized out>, len=<optimized out>, func=<optimized out>, priv=<optimized out>, priv=<optimized out>) at socket.c:246
#7  0x00007fa6f739de36 in __socket_ssl_readv (opvector=opvector@entry=0x7fa6f6abedd0, opcount=opcount@entry=1, this=<optimized out>, this=<optimized out>) at socket.c:552
#8  0x00007fa6f739e35b in __socket_ssl_read (count=<optimized out>, buf=<optimized out>, this=0x555685ba1b98) at socket.c:572
#9  __socket_cached_read (opcount=1, opvector=0x555685699338, this=0x555685ba1b98) at socket.c:610
#10 __socket_rwv (this=this@entry=0x555685ba1b98, vector=<optimized out>, count=count@entry=1, pending_vector=pending_vector@entry=0x5556856993a8, pending_count=pending_count@entry=0x5556856993b4, bytes=bytes@entry=0x7fa6f6abeea0,
    write=0) at socket.c:721
#11 0x00007fa6f73a0438 in __socket_readv (bytes=0x7fa6f6abeea0, pending_count=0x5556856993b4, pending_vector=0x5556856993a8, count=1, vector=<optimized out>, this=0x555685ba1b98) at socket.c:2102
#12 __socket_read_frag (this=0x555685ba1b98) at socket.c:2102
#13 socket_proto_state_machine (pollin=<synthetic pointer>, this=0x555685ba1b98) at socket.c:2262
#14 socket_event_poll_in (notify_handled=true, this=0x555685ba1b98) at socket.c:2384
#15 socket_event_handler (event_thread_died=0, poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x555685ba1b98, gen=13, idx=2, fd=<optimized out>) at socket.c:2790
#16 socket_event_handler (fd=fd@entry=6, idx=idx@entry=2, gen=gen@entry=13, data=data@entry=0x555685ba1b98, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0) at socket.c:2710
#17 0x00007fa6fbade119 in event_dispatch_epoll_handler (event=0x7fa6f6abf054, event_pool=0x555685006018) at event-epoll.c:614
#18 event_dispatch_epoll_worker (data=0x555685036828) at event-epoll.c:725
#19 0x00007fa6fb9fa609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#20 0x00007fa6fb74b133 in clone () from /lib/x86_64-linux-gnu/libc.so.6
 
(gdb) f 5
#5  0x00007fa6f732fb47 in SSL_read () from /usr/lib/x86_64-linux-gnu/libssl.so.1.1
(gdb) info locals
No symbol table info available.
(gdb) f 9
#9  __socket_cached_read (opcount=1, opvector=0x555685699338, this=0x555685ba1b98) at socket.c:610
610     socket.c: No such file or directory.
(gdb) info ocals
Undefined info command: "ocals".  Try "help info".
(gdb) info locals
ret = -1
priv = 0x555685699218
in = 0x555685699318
req_len = 8
priv = <optimized out>
in = <optimized out>
req_len = <optimized out>
ret = <optimized out>
(gdb) l
605     in socket.c
(gdb) f 7
#7  0x00007fa6f739de36 in __socket_ssl_readv (opvector=opvector@entry=0x7fa6f6abedd0, opcount=opcount@entry=1, this=<optimized out>, this=<optimized out>) at socket.c:552
552     in socket.c
(gdb) info locals
priv = 0x555685699218
sock = <optimized out>
ret = -1
__FUNCTION__ = "__socket_ssl_readv"
(gdb) f 15
#15 socket_event_handler (event_thread_died=0, poll_err=0, poll_out=<optimized out>, poll_in=<optimized out>, data=0x555685ba1b98, gen=13, idx=2, fd=<optimized out>) at socket.c:2790
2790    in socket.c
(gdb) l
2785    in socket.c
(gdb) info locals
this = <optimized out>
ret = <optimized out>
ctx = <optimized out>
notify_handled = <optimized out>
priv = 0x555685699218
socket_closed = <optimized out>
this = <optimized out>
priv = <optimized out>
ret = <optimized out>
ctx = <optimized out>
socket_closed = <optimized out>
notify_handled = <optimized out>
__FUNCTION__ = "socket_event_handler"
sock_type = <optimized out>
sa = <optimized out>
(gdb)

Additional info:

- The operating system / glusterfs version:
It is reproducible with gluster version 9.6 and 11.0 on Ubuntu setup installed with Debian files.

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

@samirsss
Copy link

samirsss commented Feb 7, 2024

Bump on this one to see if there is a solution

@samirsss
Copy link

samirsss commented Feb 7, 2024

@amarts @avati - can you please point us in the right direction so that we can proceed. segfaults are not typical and hence wondering why this is being ignored.

@aravindavk
Copy link
Member

I will look into this and update.

  • Is add-brick command failing or mount is failed after the add-brick command?
  • Why not used reset-brick command if the same brick is replaced?

@samirsss
Copy link

samirsss commented Feb 7, 2024

Thanks @aravindavk - @SowjanyaKotha will reply on this. Really appreciate the quick response here 👍

@SowjanyaKotha
Copy link
Author

SowjanyaKotha commented Feb 8, 2024

@aravindavk The fault on the existing node volume happens at different times. add-brick is on such case(most cases), It can happen at remove-brick as well.
When the node is replaced, the new node is clean and gluster packages are installed. The node is offline before the remove-brick is done. So, didn't use reset-brick.

@samirsss
Copy link

@aravindavk any updates on this? We're hitting this issue consistently after a few attempts and hence pushing for a solution

@samirsss
Copy link

@amarts @avati seems like support for the project is lacking now. Can someone help please.

@aravindavk
Copy link
Member

From the backtrace, I can see that SSL_read is crashed.

What were the steps used to setup new node and the existing nodes (Clients and Servers)?

New SSL key generated in the new node (used in add-brick command) or SSL key file is reused from the existing node that is replaced?

If cleanup is not done to /usr/lib/ssl/glusterfs.ca file, then delete this file or find the old node's certificate from this file and add the new node's details.

@aravindavk
Copy link
Member

I tested this in our lab, couldn't reproduce the crash. The steps I did were:

  • Generate SSL key and certificate in two nodes (server1.gluster and server2.gluster)
  • Create two node Cluster. Peer probe server2.gluster from server1.gluster
  • Create a Replica 2 volume
  • Mount the volume in server1.gluster
  • Simulate the node failure (server2.gluster)
  • Setup a new node with same hostname server2.gluster
  • Create SSL key and certificate in the new node
  • Regenerate the glusterfs.ca file and copy to all nodes and clients
  • Start Glusterd in the new node
  • Run peer probe again
  • Run add-brick command to re-add the brick.
  • Verify the existing mount by creating a new file

The details about the tests are available here:

https://github.com/aravindavk/gluster-tests?tab=readme-ov-file#gluster-tls-with-node-replacement-test

@SowjanyaKotha
Copy link
Author

SowjanyaKotha commented Feb 16, 2024

@aravindavk A new certificate is created for the node. But the issue happens randomly. If the certificate is not correct, it should always fail. Would it matter if the cert location is not the default one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants