Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSAN failures in system layer #7803

Closed
mspang opened this issue Jun 21, 2021 · 4 comments
Closed

TSAN failures in system layer #7803

mspang opened this issue Jun 21, 2021 · 4 comments
Assignees

Comments

@mspang
Copy link
Contributor

mspang commented Jun 21, 2021

Currently chip::System::Object results in multiple TSAN failures but as it is built using the legacy (non-memory-model-aware) gcc __sync builtins it's possible these errors are false positives.

Changing to an implementation built with C++11 primitives will fix this issue, allow enabling TSAN on the unit test suite.

e.g.

./gn_build.sh is_tsan=true is_clang=true

WARNING: ThreadSanitizer: data race (pid=402928)
  Read of size 8 at 0x557080810418 by main thread:
    #0 chip::System::Timer::IsEarlierEpoch(unsigned long const&, unsigned long const&) /ssd/src/connectedhomeip4/out/debug/../../src/system/SystemTimer.cpp:111:25 (TestInetLayerDNS+0x127c36)
    #1 chip::System::Layer::GetTimeout(timeval&) /ssd/src/connectedhomeip4/out/debug/../../src/system/SystemLayer.cpp:621:18 (TestInetLayerDNS+0x126e13)
    #2 chip::System::WatchableEventManager::PrepareEventsWithTimeout(timeval&) /ssd/src/connectedhomeip4/out/debug/../../src/system/WatchableSocketSelect.cpp:164:19 (TestInetLayerDNS+0x128047)
    #3 ServiceEvents(timeval&) /ssd/src/connectedhomeip4/out/debug/../../src/inet/tests/TestInetCommonPosix.cpp:460:36 (TestInetLayerDNS+0x1239c3)
    #4 ServiceNetwork(timeval&) /ssd/src/connectedhomeip4/out/debug/../../src/inet/tests/TestInetCommon.h:73:5 (TestInetLayerDNS+0x123026)
    #5 ServiceNetworkUntilDone(unsigned int) /ssd/src/connectedhomeip4/out/debug/../../src/inet/tests/TestInetLayerDNS.cpp:664:9 (TestInetLayerDNS+0x123026)
    #6 RunTestCase(_nlTestSuite*, DNSResolutionTestCase const&) /ssd/src/connectedhomeip4/out/debug/../../src/inet/tests/TestInetLayerDNS.cpp:515:5 (TestInetLayerDNS+0x123026)
    #7 TestDNSResolution_RestrictedResults(_nlTestSuite*, void*) /ssd/src/connectedhomeip4/out/debug/../../src/inet/tests/TestInetLayerDNS.cpp:215:5 (TestInetLayerDNS+0x1229cf)
    #8 nlTestRunner /ssd/src/connectedhomeip4/out/debug/../../third_party/nlunit-test/repo/src/nlunit-test.c:213:9 (TestInetLayerDNS+0x150065)
    #9 TestInetLayerDNSInternal() /ssd/src/connectedhomeip4/out/debug/../../src/inet/tests/TestInetLayerDNS.cpp:711:5 (TestInetLayerDNS+0x12291d)
    #10 main /ssd/src/connectedhomeip4/out/debug/linux_x64_clang/gen/TestInetLayerDNS.driver.cpp:32:20 (TestInetLayerDNS+0x1225a9)

  Previous write of size 8 at 0x557080810418 by thread T1:
    #0 chip::System::Timer::ScheduleWork(void (*)(chip::System::Layer*, void*, int), void*) /ssd/src/connectedhomeip4/out/debug/../../src/system/SystemTimer.cpp:212:24 (TestInetLayerDNS+0x127d73)
    #1 chip::System::Layer::ScheduleWork(void (*)(chip::System::Layer*, void*, int), void*) /ssd/src/connectedhomeip4/out/debug/../../src/system/SystemLayer.cpp:396:23 (TestInetLayerDNS+0x126b3a)
    #2 chip::Inet::AsyncDNSResolverSockets::NotifyChipThread(chip::Inet::DNSResolver*) /ssd/src/connectedhomeip4/out/debug/../../src/inet/AsyncDNSResolverSockets.cpp:316:18 (TestInetLayerDNS+0x125383)
    #3 chip::Inet::AsyncDNSResolverSockets::AsyncDNSThreadRun(void*) /ssd/src/connectedhomeip4/out/debug/../../src/inet/AsyncDNSResolverSockets.cpp:342:9 (TestInetLayerDNS+0x125383)

  Location is global 'chip::System::Timer::sPool' of size 648 at 0x5570808103f8 (TestInetLayerDNS+0x000000c4a418)

  Thread T1 (tid=402957, running) created by main thread at:
    #0 pthread_create ../staging/llvm_build/tools/clang/stage2-bins/runtimes/runtimes-x86_64-unknown-linux-gnu-bins/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:973:3 (TestInetLayerDNS+0xb11ab)
    #1 chip::Inet::AsyncDNSResolverSockets::Init(chip::Inet::InetLayer*) /ssd/src/connectedhomeip4/out/debug/../../src/inet/AsyncDNSResolverSockets.cpp:71:22 (TestInetLayerDNS+0x1252f0)
    #2 chip::Inet::InetLayer::Init(chip::System::Layer&, void*) /ssd/src/connectedhomeip4/out/debug/../../src/inet/InetLayer.cpp:287:29 (TestInetLayerDNS+0x123f25)
    #3 InitNetwork() /ssd/src/connectedhomeip4/out/debug/../../src/inet/tests/TestInetCommonPosix.cpp:440:11 (TestInetLayerDNS+0x12392f)
    #4 TestInetLayerDNSInternal() /ssd/src/connectedhomeip4/out/debug/../../src/inet/tests/TestInetLayerDNS.cpp:707:5 (TestInetLayerDNS+0x122913)
    #5 main /ssd/src/connectedhomeip4/out/debug/linux_x64_clang/gen/TestInetLayerDNS.driver.cpp:32:20 (TestInetLayerDNS+0x1225a9)

SUMMARY: ThreadSanitizer: data race /ssd/src/connectedhomeip4/out/debug/../../src/system/SystemTimer.cpp:111:25 in chip::System::Timer::IsEarlierEpoch(unsigned long const&, unsigned long const&)
@mspang
Copy link
Contributor Author

mspang commented Jun 21, 2021

@mrjerryjohns FYI

@mrjerryjohns
Copy link
Contributor

I'd think the right fix in the short-term is to fix Platform::Layer::PostEvent on POSIX to be thread-safe...

This likely isn't a problem on Darwin.

@kpschoedel
Copy link
Contributor

The headline issue (for timers) no longer exists after changes in #8800. Arguably that code is locking too much in light of issue #5556, but the cost is isolated. It's required (as in the headline case) due to the SDK-internal threading in AsyncDNSResolverSockets.

System::Object has a similar problem with pool allocation in general, which shows up in TestSystemObject and I don't currently see a simple C++ solution; making mSystemLayer atomic would be expensive since it's read frequently. Again, arguably #5556 says that pool allocation doesn't need to be thread-safe, but again, AsyncDNSResolverSockets will cause Timer-pool allocations in the current implementation.

Other current TSAN failures on Linux are:

  1. CHIP_WITH_GIO — plausibly false positives due to untsanified libraries, and avoidable by running TSAN tests without it.
  2. Linux/MdnsImpl reporting lock inversion in use of avahi. This might be addressed by my planned removal of the select()-loop ‘back door’ (which is what led me to this issue) but that's blocked by other problems (MdnsAvahi destructor segfault; require separate shutdown #8001).

@kpschoedel
Copy link
Contributor

Filed #14890 for the CHIP_WITH_GIO issue.

Closing this issue since there are no known failures remaining in system layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants