Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queue-test: a message queue gets closed while still being used #1164

Open
Tracked by #1161
stanislaw opened this issue Sep 22, 2021 · 0 comments
Open
Tracked by #1161

queue-test: a message queue gets closed while still being used #1164

stanislaw opened this issue Sep 22, 2021 · 0 comments
Labels
bug unit-test Tickets related to the OSAL unit testing (functional and/or coverage)

Comments

@stanislaw
Copy link
Contributor

Is your feature request related to a problem? Please describe.

This report is similar to #1160 because a resource, in this case a message queue, gets closed while it's still being used by another thread.

This issue silently works on Linux because of its mqueue implementation details. The macOS implementation of mqueue in #1161 maintains a mutex in each of its methods including the mq_close method.

The deadlock results from the following collision:

  1. Main thread attempts to OS_QueueDelete that calls into mq_close that wants to acquire the queue's mutex.
  2. The task1 that has been deleted just before was still waiting on mq_timedreceive holding the queue's mutex.
    status = OS_TimerDelete(timer_id);
    UtAssert_True(status == OS_SUCCESS, "Timer delete Rc=%d", (int)status);

    // When a task is deleted below, its executing thread gets cancelled and destroyed but at that moment 
    // the task1 is still waiting on the message queue.
    status = OS_TaskDelete(task_1_id);
    UtAssert_True(status == OS_SUCCESS, "Task 1 delete Rc=%d", (int)status);
    status = OS_QueueDelete(msgq_id); // deadlock here
    UtAssert_True(status == OS_SUCCESS, "Queue 1 delete Rc=%d", (int)status);

Describe the solution you'd like

One could argue that the queue-test works on Linux and therefore the macOS implementation of mqueue should accommodate. At the same time, similar to the #1160, it looks like macOS implementation actually highlights the fact that the queue-test relied on undefined behavior.

Describe alternatives you've considered

For now, there is a custom hack to trylock on the mqueue's mutex before actually trying to close the mqueue.

    // TODO: without this trylock, queue-test deadlocks on macOS
    if ((n = pthread_mutex_trylock(&mqhdr->mqh_lock)) == EBUSY)
    {
        (void)pthread_mutex_unlock(&mqhdr->mqh_lock);
    }
    if ((n = pthread_mutex_lock(&mqhdr->mqh_lock)) != 0) {
        errno = n;
        return (-1);
    }

Additional context

See the stacktrace of the blocked thread below. Note that I cannot provide the second thread that holds the lock because that thread is pthread_cancelled by that point (and the mutex has leaked).

Requester Info

Stanislav Pankevich (Personal contribution)


Stacktrace:

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff203b64ca libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff203e72ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff203e5192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x00000001076e20af queue-test`mq_notify(mqd=0x00007fdcc96040e0, notification=0x0000000000000000) at mq_notify.c:26:14
    frame #4: 0x00000001076e366e queue-test`mq_close(mqd=0x00007fdcc96040e0) at mq_close.c:25:9
    frame #5: 0x00000001076e00b7 queue-test`OS_QueueDelete_Impl(token=0x00007ffee85328a8) at os-impl-queues.c:175:9
    frame #6: 0x00000001076dae67 queue-test`OS_QueueDelete(queue_id=131073) at osapi-queue.c:142:23
    frame #7: 0x00000001076d12e5 queue-test`QueueTimeoutCheck at queue-test.c:122:14
    frame #8: 0x00000001076d4309 queue-test`UtTest_Run at uttest.c:174:17
    frame #9: 0x00000001076d38f9 queue-test`OS_Application_Run at utbsp.c:232:5
    frame #10: 0x00000001076e6dfa queue-test`main(argc=1, argv=0x00007ffee8532a08) at bsp_start.c:247:5
    frame #11: 0x00007fff20404f3d libdyld.dylib`start + 1
    frame #12: 0x00007fff20404f3d libdyld.dylib`start + 1

Full test log, note that the mqueue's mutex is locked 19 times but unlocked only 18 times.

/Users/stanislaw/workspace/projects/code/osal/cmake-build-debug/tests/queue-test

[BEGIN] UNIT TEST

[BEGIN] 01 SETUP
[  END] No test cases

[BEGIN] 01 QueueTimeoutTest
[ PASS] 01.001 queue-test.c:154 - MsgQ create Id=20001 Rc=0
[ PASS] 01.002 queue-test.c:161 - Task 1 create Id=10001 Rc=0
[ PASS] 01.003 queue-test.c:167 - Timer 1 create Id=90001 Rc=0
[ INFO] queue-test.c:168:Timer Accuracy = 10000 microseconds 
Starting task 1
Delay for 1 second before starting
[ PASS] 01.004 queue-test.c:174 - Timer 1 set Rc=0
mq_timedreceive: lock
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 10
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 15
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 20
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 25
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 30
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 35
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 40
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 45
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 50
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 55
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 60
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 66
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 70
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 75
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 80
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 85
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 90
mq_timedreceive: unlock
mq_timedreceive: lock
TASK 1: Timeout on Queue! Timer counter = 95
[ PASS] 01.005 queue-test.c:119 - Timer delete Rc=0
[ PASS] 01.006 queue-test.c:121 - Task 1 delete Rc=0
@skliper skliper added bug unit-test Tickets related to the OSAL unit testing (functional and/or coverage) labels Sep 28, 2021
jphickey added a commit to jphickey/osal that referenced this issue Aug 10, 2022
For commands containing file names, replace the call to
CFE_SB_MessageStringGet() - which is just a basic copy - to
the new filename-aware function CFE_FS_ParseInputFileName().

This means that the default pathname/extension logic is applied
here too and only a "basename" is strictly necessary, although
if a full/absolute path is given, it will be used as is.
jphickey pushed a commit to jphickey/osal that referenced this issue Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unit-test Tickets related to the OSAL unit testing (functional and/or coverage)
Projects
None yet
Development

No branches or pull requests

2 participants