Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docdb] race condition on tablet shutdown with accessing rocksdb instance #6960

Closed
bmatican opened this issue Jan 21, 2021 · 0 comments
Closed
Assignees
Labels
area/build-framework Building and packaging third-party dependencies (repo: yugabyte/yugabyte-db-thirdparty) area/docdb YugabyteDB core features kind/bug This issue is a bug

Comments

@bmatican
Copy link
Contributor

TSAN report

WARNING: ThreadSanitizer: data race (pid=8205)
  Write of size 8 at 0x7b7400271368 by thread T188 (mutexes: write M1109428483929191856):
    #0 std::__1::unique_ptr<rocksdb::DB, std::__1::default_delete<rocksdb::DB> >::reset(rocksdb::DB*) /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20201207195011-bbf28cf4f1-centos7-linuxbrew/installed/tsan/libcxx/include/c++/v1/memory:2632:20 (libtablet.so+0x2228ac)
    #1 yb::tablet::ResetRocksDB(bool, rocksdb::Options const&, std::__1::unique_ptr<rocksdb::DB, std::__1::default_delete<rocksdb::DB> >*) /nfusr/centos-gcp-cloud/jenkins-worker-3a1/jenkins/jenkins-github-yugabyte-db-centos-master-clang-tsan-1685/build/tsan-clang-dynamic-ninja/../../src/yb/tablet/tablet.cc:1000 (libtablet.so+0x2228ac)
    #2 yb::tablet::Tablet::ResetRocksDBs(yb::StronglyTypedBool<yb::tablet::Destroy_Tag>, yb::StronglyTypedBool<yb::tablet::DisableFlushOnShutdown_Tag>) /nfusr/centos-gcp-cloud/jenkins-worker-3a1/jenkins/jenkins-github-yugabyte-db-centos-master-clang-tsan-1685/build/tsan-clang-dynamic-ninja/../../src/yb/tablet/tablet.cc:1016:27 (libtablet.so+0x222725)
    #3 yb::tablet::Tablet::Truncate(yb::tablet::TruncateOperationState*) /nfusr/centos-gcp-cloud/jenkins-worker-3a1/jenkins/jenkins-github-yugabyte-db-centos-master-clang-tsan-1685/build/tsan-clang-dynamic-ninja/../../src/yb/tablet/tablet.cc:2576:12 (libtablet.so+0x23165a)
...
  Previous read of size 8 at 0x7b7400271368 by thread T130:
    #0 std::__1::unique_ptr<rocksdb::DB, std::__1::default_delete<rocksdb::DB> >::get() const /opt/yb-build/thirdparty/yugabyte-db-thirdparty-v20201207195011-bbf28cf4f1-centos7-linuxbrew/installed/tsan/libcxx/include/c++/v1/memory:2607:19 (libtserver.so+0x22eaa7)
    #1 yb::tablet::Tablet::doc_db() const /nfusr/centos-gcp-cloud/jenkins-worker-3a1/jenkins/jenkins-github-yugabyte-db-centos-master-clang-tsan-1685/build/tsan-clang-dynamic-ninja/../../src/yb/tablet/tablet.h:534 (libtserver.so+0x22eaa7)
    #2 yb::tablet::Tablet::StillHasParentDataAfterSplit() /nfusr/centos-gcp-cloud/jenkins-worker-3a1/jenkins/jenkins-github-yugabyte-db-centos-master-clang-tsan-1685/build/tsan-clang-dynamic-ninja/../../src/yb/tablet/tablet.cc:3051:10 (libtablet.so+0x233040)

Sample tests

/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1657/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-redisserver__redisserver-test/TestRedisService_TestTruncate.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1667/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__create-table-stress-test/CreateTableStressTest_TestHeartbeatDeadline.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1668/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_TruncateDuringShutdown.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1668/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-redisserver__redisserver-test/TestRedisService_TestTruncate.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1675/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_TruncateDuringShutdown.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1676/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_TruncateDuringShutdown.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1676/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__create-table-stress-test/CreateTableStressTest_TestHeartbeatDeadline.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1677/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-redisserver__redisserver-test/TestRedisService_TestTruncate.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1678/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-redisserver__redisserver-test/TestRedisService_TestTruncate.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1680/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_TruncateDuringShutdown.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1680/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-integration-tests__create-table-stress-test/CreateTableStressTest_TestHeartbeatDeadline.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1681/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_TruncateDuringShutdown.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1685/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__snapshot-txn-test/SnapshotTxnTest_TruncateDuringShutdown.log
/var/lib/jenkins/jobs/github-yugabyte-db-centos-master-clang-tsan/builds/1686/archive/build/tsan-clang-dynamic-ninja/yb-test-logs/tests-client__backup-txn-test/BackupTxnTest_Simple.log

Note: This is a product issue, as this is a race during truncate (likely same for drop table).

@bmatican bmatican added the area/docdb YugabyteDB core features label Jan 21, 2021
robertsami added a commit that referenced this issue Feb 9, 2021
… in StillHasParentDataAfterSplit

Summary:
We noticed tsan failures caused by concurrent access of RocksDB pointers on a tablet by the heartbeater while
said RocksDB pointers were being invalidated (by a TRUNCATE, DROP, or other shutdown which might invoke
ResetRocksDBs). This diff synchronizes the Tablet::StillHasParentDataAfterSplit with the existing ScopedRWOperationPause invoked by
TRUNCATE/DROP/ALTER operations by registering a ScopedRWOperation in the method call.

In order to make this change, the signature of StillHasParentDataAfterSplit was also changed to return a Result<bool>, which then triggered some slightly more complex changes. Specifically:
1. We expose a new DefinitelyHasNoParentDataAfterSplit which returns false if StillHasParentDataAfterSplit fails or is true
2. ts_tablet_manager.cc uses DefinitelyHasNoParentDataAfterSplit to maintain a list of tablets which "maybe" should not be moved by the load balancer, and reports this list to the LB conservatively including any tablets for which DefinitelyHasNoParentDataAfterSplit returned false, which may include cases where StillHasParentDataAfterSplit simply failed
3. TabletSplitHeartbeatDataProvider will skip reporting tablets for which DefinitelyHasNoParentDataAfterSplit is false

Test Plan:
Try running tests which were triggering TSAN failures before:
`ybd tsan --cxx-test client_snapshot-txn-test --gtest_filter SnapshotTxnTest.TruncateDuringShutdown -n 100`
`ybd tsan --cxx-test client_backup-txn-test --gtest_filter BackupTxnTest.Simple -n 100`

Reviewers: timur, nicolas

Reviewed By: timur, nicolas

Subscribers: mpolitov, sergei, nicolas, ybase, timur, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D10457
robertsami added a commit that referenced this issue Feb 23, 2021
…ing doc_db data in StillHasParentDataAfterSplit

Summary:
We noticed tsan failures caused by concurrent access of RocksDB pointers on a tablet by the heartbeater while
said RocksDB pointers were being invalidated (by a TRUNCATE, DROP, or other shutdown which might invoke
ResetRocksDBs). This diff synchronizes the Tablet::StillHasParentDataAfterSplit with the existing ScopedRWOperationPause invoked by
TRUNCATE/DROP/ALTER operations by registering a ScopedRWOperation in the method call.

In order to make this change, the signature of StillHasParentDataAfterSplit was also changed to return a Result<bool>, which then triggered some slightly more complex changes. Specifically:
1. We expose a new DefinitelyHasNoParentDataAfterSplit which returns false if StillHasParentDataAfterSplit fails or is true
2. ts_tablet_manager.cc uses DefinitelyHasNoParentDataAfterSplit to maintain a list of tablets which "maybe" should not be moved by the load balancer, and reports this list to the LB conservatively including any tablets for which DefinitelyHasNoParentDataAfterSplit returned false, which may include cases where StillHasParentDataAfterSplit simply failed
3. TabletSplitHeartbeatDataProvider will skip reporting tablets for which DefinitelyHasNoParentDataAfterSplit is false

Original commit:  D10457 / e6c3c57

Test Plan:
Jenkins: rebase: 2.4

Try running tests which were triggering TSAN failures before:
`ybd tsan --cxx-test client_snapshot-txn-test --gtest_filter SnapshotTxnTest.TruncateDuringShutdown -n 100`
`ybd tsan --cxx-test client_backup-txn-test --gtest_filter BackupTxnTest.Simple -n 100`

Reviewers: timur, nicolas, bogdan

Reviewed By: bogdan

Subscribers: mpolitov, ybase, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D10676
@bmatican bmatican closed this as completed Mar 4, 2021
polarweasel pushed a commit to lizayugabyte/yugabyte-db that referenced this issue Mar 9, 2021
…_db data in StillHasParentDataAfterSplit

Summary:
We noticed tsan failures caused by concurrent access of RocksDB pointers on a tablet by the heartbeater while
said RocksDB pointers were being invalidated (by a TRUNCATE, DROP, or other shutdown which might invoke
ResetRocksDBs). This diff synchronizes the Tablet::StillHasParentDataAfterSplit with the existing ScopedRWOperationPause invoked by
TRUNCATE/DROP/ALTER operations by registering a ScopedRWOperation in the method call.

In order to make this change, the signature of StillHasParentDataAfterSplit was also changed to return a Result<bool>, which then triggered some slightly more complex changes. Specifically:
1. We expose a new DefinitelyHasNoParentDataAfterSplit which returns false if StillHasParentDataAfterSplit fails or is true
2. ts_tablet_manager.cc uses DefinitelyHasNoParentDataAfterSplit to maintain a list of tablets which "maybe" should not be moved by the load balancer, and reports this list to the LB conservatively including any tablets for which DefinitelyHasNoParentDataAfterSplit returned false, which may include cases where StillHasParentDataAfterSplit simply failed
3. TabletSplitHeartbeatDataProvider will skip reporting tablets for which DefinitelyHasNoParentDataAfterSplit is false

Test Plan:
Try running tests which were triggering TSAN failures before:
`ybd tsan --cxx-test client_snapshot-txn-test --gtest_filter SnapshotTxnTest.TruncateDuringShutdown -n 100`
`ybd tsan --cxx-test client_backup-txn-test --gtest_filter BackupTxnTest.Simple -n 100`

Reviewers: timur, nicolas

Reviewed By: timur, nicolas

Subscribers: mpolitov, sergei, nicolas, ybase, timur, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D10457
@ttyusupov ttyusupov added area/build-framework Building and packaging third-party dependencies (repo: yugabyte/yugabyte-db-thirdparty) kind/bug This issue is a bug labels Jul 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/build-framework Building and packaging third-party dependencies (repo: yugabyte/yugabyte-db-thirdparty) area/docdb YugabyteDB core features kind/bug This issue is a bug
Projects
None yet
Development

No branches or pull requests

4 participants