Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deflake unit test DBErrorHandlingFSTest.AtomicFlushNoSpaceError #13234

Closed
wants to merge 1 commit into from

Conversation

cbi42
Copy link
Member

@cbi42 cbi42 commented Dec 19, 2024

Summary: DBErrorHandlingFSTest.AtomicFlushNoSpaceError is flaky due to seg fault during error recovery:

...
frame #5: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>*, std::vector<rocksdb::ObsoleteBlobFileInfo, std::allocator<rocksdb::ObsoleteBlobFileInfo>>*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, unsigned long) [inlined] std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>::begin(this=<unavailable>) at stl_vector.h:812:16
frame #6: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(this=0x0000000000000000, files=size=0, blob_files=size=0, manifest_filenames=size=0, min_pending_output=18446744073709551615) at version_set.cc:7258:18
frame #7: 0x00007f0b3e8ccbc0 librocksdb.so.9.10`rocksdb::DBImpl::FindObsoleteFiles(this=<unavailable>, job_context=<unavailable>, force=<unavailable>, no_full_scan=<unavailable>) at db_impl_files.cc:162:30
frame #8: 0x00007f0b3e85e698 librocksdb.so.9.10`rocksdb::DBImpl::ResumeImpl(this=<unavailable>, context=<unavailable>) at db_impl.cc:434:20
frame #9: 0x00007f0b3e921516 librocksdb.so.9.10`rocksdb::ErrorHandler::RecoverFromBGError(this=<unavailable>, is_manual=<unavailable>) at error_handler.cc:632:46

I suspect that this is due to DB being destructed and reopened during recovery. Specifically, the ClearBGError() call can release and re-acquire mutex, and DB can be closed during this time. So it's not safe to access DB states after ClearBGError(). There was a similar story in #9496. Moving the obsolete files logic after ClearBGError() probably makes the seg fault more easily triggered.

This PR updates ClearBGError() to guarantee that db close cannot finish until the method returns and the mutex is released. So that we can safely access DB states after calling it.

Test plan: I could not trigger the seg fault locally, will just monitor future test failures.

@facebook-github-bot
Copy link
Contributor

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@cbi42 cbi42 requested review from hx235 and jowlyzhang December 19, 2024 21:52
Copy link
Contributor

@jowlyzhang jowlyzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. Nice debugging!

@facebook-github-bot
Copy link
Contributor

@cbi42 merged this pull request in cc30226.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants