Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router crash on sys_mutex_unlock #1134

Closed
ganeshmurthy opened this issue Jun 21, 2023 · 1 comment · Fixed by #1143
Closed

Router crash on sys_mutex_unlock #1134

ganeshmurthy opened this issue Jun 21, 2023 · 1 comment · Fixed by #1143
Assignees
Milestone

Comments

@ganeshmurthy
Copy link
Contributor

ganeshmurthy commented Jun 21, 2023

In our Longevity machine, we have a iperf test running for the last 4 days, and we got 11 router restarts
The test runs in 3 namespaces (public1, public2, private1), created in a single CRC cluster.
Each namespace has an iperf server exposed, but the client runs in the private1 NS.
We send requests for all the 3 services, forever, using this commands :

     while : do
       iperf3 -c iperf3-server-a -n 1G -P 5
       iperf3 -c iperf3-server-b -n 1G -P 5
       iperf3 -c iperf3-server-c -n 1G -P 5
     done  

These are the images we are using in this test :

    skupper version -n private1
      client version                 1.4.0.rc3-rh-2
      transport version              xxx.redhat.com/rh-osbs/service-interconnect-skupper-router-rhel9:2.4.0-8 (sha256:1235980cd823)
      controller version             xxx.redhat.com/rh-osbs/service-interconnect-service-controller-rhel9:1.4.0-3 (sha256:63116dbceb05)
      config-sync version            xxx.redhat.com/rh-osbs/service-interconnect-config-sync-rhel9:1.4.0-2 (sha256:030901042633)
      flow-collector version         not-found

This is the status of the pods : 

Namespace: public1
  skupper-router-779c9dff6-k7nqj                2/2     Running   0          4d20h
  skupper-service-controller-69fb6bbf7d-tgkl7   1/1     Running   0          4d20h

Namespace: public2
  skupper-router-54978cd876-56fd2               2/2     Running   0          4d20h
  skupper-service-controller-6ff6576c69-wbkxl   1/1     Running   0          4d20h

Namespace: private1
  skupper-router-7c8fb87d75-fmnnf               2/2     Running   11 (8h ago)   4d20h
  skupper-service-controller-5679f4cf86-s57r8   1/1     Running   0             4d20h

From the pod logs : 

  *** SKUPPER-ROUTER FATAL ERROR ***
  Version: 2.4.0-rc3-rh-8
  Signal: 11 SIGSEGV
  Process ID: 1 (skrouterd)
  Thread ID: 15 (wrkr

_0)

and also : 

  023-06-20 05:09:31.720307 +0000 TCP_ADAPTOR (info) [C47525] PN_RAW_CONNECTION_DISCONNECTED listener, drained_buffers=0
  skrouterd: /remote-source/skupper-router/app/src/posix/threading.c:61: sys_mutex_unlock: Assertion `result == 0' failed.
  2023-06-20 05:09:31.720612 +0000 TCP_ADAPTOR (info) [C47523] PN_RAW_CONNECTION_DISCONNECTED listener, drained_buffers=0

  *** SKUPPER-ROUTER FATAL ERROR ***
  Version: 2.4.0-rc3-rh-8
  Signal: 6 SIGABRT
  Process ID: 1 (skrouterd)
  Thread ID: 16 (wrkr_1)
@kgiusti kgiusti self-assigned this Jun 21, 2023
@kgiusti
Copy link
Contributor

kgiusti commented Jun 21, 2023

This appears to be a race involving the Q2 unblock handler. What I believe is happening is that the upstream (ingress) TCP connection is being force-closed while in Q2 blocked state. That TCP connection is freed, but the downstream connection (AMQP inter-router) has just drained the ingress message buffers to relieve the Q2 block. This causes the downstream connection to invoke the Q2 unblock handler registered with the message by the upstream TCP connection. Since that connection has just been freed the handler faults when it attempts to manipulate the upstream TCP connection.

Here is a traceback of the crash:

(gdb) bt
#0  __GI_abort () at abort.c:107
#1  0x00007f7f6a9c271b in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, 
    function=<optimized out>) at assert.c:92
#2  0x00007f7f6a9e7ce6 in __assert_fail (assertion=assertion@entry=0x55c831e951d9 "result == 0", 
    file=file@entry=0x55c831e95a18 "/remote-source/skupper-router/app/src/posix/threading.c", line=line@entry=61, 
    function=0x55c831e96c40 <__PRETTY_FUNCTION__.14.lto_priv.0> "sys_mutex_unlock") at assert.c:101
#3  0x000055c831e0a8f7 in sys_mutex_unlock (mutex=<optimized out>) at /remote-source/skupper-router/app/src/posix/threading.c:61
#4  0x000055c831ddaca6 in sys_mutex_unlock (mutex=0x7f7f584bd478) at /remote-source/skupper-router/app/src/adaptors/tcp/tcp_adaptor.c:300
#5  qdr_tcp_q2_unblocked_handler (context=...) at /remote-source/skupper-router/app/src/adaptors/tcp/tcp_adaptor.c:306
#6  0x000055c831e078cb in qd_message_send (in_msg=<optimized out>, link=<optimized out>, ra_flags=<optimized out>, q3_stalled=<optimized out>)
    at /remote-source/skupper-router/app/src/message.c:1926
#7  0x000055c831e4e4b5 in CORE_link_deliver (context=0x55c833b074b0, link=0x7f7f5027fc48, dlv=0x7f7f4427a888, settled=<optimized out>)
    at /remote-source/skupper-router/app/src/router_node.c:1973
#8  0x000055c831e3b592 in qdr_link_process_deliveries (core=<optimized out>, link=0x7f7f5027fc48, credit=<optimized out>)
    at /remote-source/skupper-router/app/src/router_core/transfer.c:180
#9  0x000055c831e22549 in qdr_connection_process (conn=0x7f7f4411fa48) at /remote-source/skupper-router/app/src/router_core/connections.c:438
#10 0x000055c831e89018 in writable_handler.constprop.0 (container=0x55c833aa0240, qd_conn=0x7f7f5800a288, conn=<optimized out>)
    at /remote-source/skupper-router/app/src/container.c:388
#11 0x000055c831e5992d in thread_run (arg=0x55c833aac0c0) at /remote-source/skupper-router/app/src/server.c:1134
#12 0x00007f7f6aa39802 in start_thread (arg=<optimized out>) at pthread_create.c:443
#13 0x00007f7f6a9d9450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants