"Crash" on TestInetEndPoint in CI on Darwin #10025

bzbarsky-apple · 2021-09-28T16:36:42Z

Problem

Sometimes the TestInetEndPoint test "crashes" in CI on Darwin like so:

../../third_party/pigweed/repo/targets/host/run_test: line 18: 17937 Abort trap: 6           "$@"
INF Test 1/1: [FAIL] TestInetEndPoint

We added diagnostic log upload, but it turns out this is not actually a crash; it's an abort() call, so it looks like that does not generate a diagnostic log...

But the logging we added for VerifyOrDie does catch this in https://github.com/project-chip/connectedhomeip/runs/3733865170?check_suite_focus=true:

[1632840981785] [17937:91664] CHIP: [IN] Async DNS worker thread woke up.
[1632840981785] [17937:91664] CHIP: [IN] Posting DNS completion event to CHIP thread.
[1632840981785] [17937:91640] CHIP: [SPT] VerifyOrDie failure at ../../src/system/SystemTimer.cpp:152: lTimer->mNextTimer != add

Proposed Solution

Sort out why that VerifyOrDie is getting hit and fix it.

@mspang @kpschoedel @andy31415 @woody-apple

The text was updated successfully, but these errors were encountered:

bzbarsky-apple · 2021-09-28T19:41:47Z

So the abort is because something is adding a timer that is already in the timer list.

I did some test runs with the log line above replaced with Mac-specific backtrace dumping, and the stack trace (sorry, no line numbers, but I expect them to be pretty obvious for the most part) looks like this:

chip::System::Timer::List::Add(chip::System::Timer*)
chip::System::Timer::MutexedList::Add(chip::System::Timer*)
chip::System::LayerImplSelect::StartTimer(unsigned int, void (*)(chip::System::Layer*, void*), void*)
ServiceEvents(unsigned int)
ServiceNetwork(unsigned int)
TestResolveHostAddress(_nlTestSuite*, void*)

the only real line number ambiguity here is which of the ServiceNetwork calls in TestResolveHostAddress is involved here. The test log has:

1632856161377] [67621:187410] CHIP: [IN] Async DNS worker thread woke up.
[1632856161381] [67621:187410] CHIP: [IN] Posting DNS completion event to CHIP thread.
    DNS name resolution complete: 142.250.69.196
    DNS name resolution complete: 127.0.0.1
[1632856161381] [67621:187411] CHIP: [IN] Async DNS worker thread woke up.
[1632856161381] [67621:187411] CHIP: [IN] Posting DNS completion event to CHIP thread.
../../third_party/pigweed/repo/targets/host/run_test: line 18: 67621 Abort trap: 6           "$@"

which annoyingly mixes ChipLog bits and printf bits, which makes it a little hard to tell what the actual ordering was, because I am not sure I can assume it matches the output order.... Will see if I can fix things here to use ChipLog consistently and reproduce again.

bzbarsky-apple · 2021-09-29T01:54:25Z

OK, so I tried running the unit tests under TSan and then everything becomes clear:

WARNING: ThreadSanitizer: data race (pid=21940)
  Read of size 8 at 0x0001021ac828 by thread T1:
    #0 chip::System::Timer::List::Remove(void (*)(chip::System::Layer*, void*), void*) SystemTimer.cpp:201 (TestInetEndPoint:x86_64+0x10004a71c)
    #1 chip::System::Timer::MutexedList::Remove(void (*)(chip::System::Layer*, void*), void*) SystemTimer.h:156 (TestInetEndPoint:x86_64+0x100046b0a)
    #2 chip::System::LayerImplSelect::CancelTimer(void (*)(chip::System::Layer*, void*), void*) SystemLayerImplSelect.cpp:166 (TestInetEndPoint:x86_64+0x100046a06)
    #3 chip::System::LayerImplSelect::ScheduleWork(void (*)(chip::System::Layer*, void*), void*) SystemLayerImplSelect.cpp:187 (TestInetEndPoint:x86_64+0x100046bf0)
    #4 chip::Inet::AsyncDNSResolverSockets::NotifyChipThread(chip::Inet::DNSResolver*) AsyncDNSResolverSockets.cpp:316 (TestInetEndPoint:x86_64+0x100043437)
    #5 chip::Inet::AsyncDNSResolverSockets::AsyncDNSThreadRun(void*) AsyncDNSResolverSockets.cpp:342 (TestInetEndPoint:x86_64+0x1000428be)

  Previous write of size 8 at 0x0001021ac828 by main thread:
    #0 chip::System::Timer::List::Remove(void (*)(chip::System::Layer*, void*), void*) SystemTimer.cpp:207 (TestInetEndPoint:x86_64+0x10004a7a9)
    #1 chip::System::Timer::MutexedList::Remove(void (*)(chip::System::Layer*, void*), void*) SystemTimer.h:156 (TestInetEndPoint:x86_64+0x100046b0a)
    #2 chip::System::LayerImplSelect::CancelTimer(void (*)(chip::System::Layer*, void*), void*) SystemLayerImplSelect.cpp:166 (TestInetEndPoint:x86_64+0x100046a06)
    #3 chip::System::LayerImplSelect::StartTimer(unsigned int, void (*)(chip::System::Layer*, void*), void*) SystemLayerImplSelect.cpp:123 (TestInetEndPoint:x86_64+0x1000465ae)
    #4 ServiceEvents(unsigned int) TestInetCommonPosix.cpp:464 (TestInetEndPoint:x86_64+0x100031238)

How could we have a data race under MutexedList::Remove you ask?

src/platform/Darwin/SystemPlatformConfig.h:#define CHIP_SYSTEM_CONFIG_NO_LOCKING 1

and from src/system/SystemMutex.h:

#if CHIP_SYSTEM_CONFIG_NO_LOCKING
inline CHIP_ERROR Init(Mutex & aMutex)
{
    return CHIP_NO_ERROR;
}
inline void Mutex::Lock() {}
inline void Mutex::Unlock() {}
#endif // CHIP_SYSTEM_CONFIG_NO_LOCKING

So the timer handling ends up completely unsynchronized, we have data races, and end up in a bad state sometimes, not surprising at that point.

@vivien-apple @kpschoedel @sagar-apple We need to figure out what the actual contract here is going to be when CHIP_SYSTEM_CONFIG_USE_DISPATCH. Two obvious problems:

If GetDispatchQueue() returns null (as it does in this test) we falll back to the non-dispatch codepaths, but those most definitely don't work if CHIP_SYSTEM_CONFIG_NO_LOCKING.
Even if GetDispatchQueue() is non-null, we store things in mTimerList, which ends up racy, because CHIP_SYSTEM_CONFIG_NO_LOCKING is set.

bzbarsky-apple · 2021-09-29T02:54:11Z

Maybe the answer is we should not be using AsyncDNSResolverSockets on Darwin if we're not supposed to be spinning up extra threads? And then we need a different impl of InetLayer::ResolveHostAddress...

In the use_dispatch setup the Matter stack is expected to not spin up threads manually, but our async DNS implementation does just that via pthreads. This is leading to random failures in TestInetEndpoint on Darwin due to thread data races. project-chip#10025

In the use_dispatch setup the Matter stack is expected to not spin up threads manually, but our async DNS implementation does just that via pthreads. This is leading to random failures in TestInetEndpoint on Darwin due to thread data races. Fixes project-chip#10025

In the use_dispatch setup the Matter stack is expected to not spin up threads manually, but our async DNS implementation does just that via pthreads. This is leading to random failures in TestInetEndpoint on Darwin due to thread data races. Fixes #10025

woody-apple linked a pull request Sep 29, 2021 that will close this issue

Add DeviceController::PairDevice(NodeId remoteId, const char * setupCode) API that looks for the device over supported networks resolves #9343 #9847

Merged

woody-apple removed a link to a pull request Sep 29, 2021

Add DeviceController::PairDevice(NodeId remoteId, const char * setupCode) API that looks for the device over supported networks resolves #9343 #9847

Merged

bzbarsky-apple mentioned this issue Sep 29, 2021

Disable async DNS when use_dispatch is enabled. #10060

Merged

andy31415 closed this as completed in #10060 Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Crash" on TestInetEndPoint in CI on Darwin #10025

"Crash" on TestInetEndPoint in CI on Darwin #10025

bzbarsky-apple commented Sep 28, 2021 •

edited

Loading

bzbarsky-apple commented Sep 28, 2021 •

edited

Loading

bzbarsky-apple commented Sep 29, 2021 •

edited

Loading

bzbarsky-apple commented Sep 29, 2021

"Crash" on TestInetEndPoint in CI on Darwin #10025

"Crash" on TestInetEndPoint in CI on Darwin #10025

Comments

bzbarsky-apple commented Sep 28, 2021 • edited Loading

Problem

Proposed Solution

bzbarsky-apple commented Sep 28, 2021 • edited Loading

bzbarsky-apple commented Sep 29, 2021 • edited Loading

bzbarsky-apple commented Sep 29, 2021

bzbarsky-apple commented Sep 28, 2021 •

edited

Loading

bzbarsky-apple commented Sep 28, 2021 •

edited

Loading

bzbarsky-apple commented Sep 29, 2021 •

edited

Loading