-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CreateColumnFamily API incompatible with latest versions #3609
Comments
@vpallipadi can you identify which virtual method called, or the callstack? |
#0 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7ff9e4c20cc9] ?? ??:0 |
Can you try whether this patch can fix it?
|
Yes. That works. |
I'm sending a PR: #3610 |
To work it around, you can keep your ColumnFamilyOptions all the time. |
Do we need a similar fix for db_options ? I get the following crash in a test while calling delete on the db pointer. PC: @ 0x7fbd6bddb13f __GI_raise |
This is the column_family_test that fails with above trace that Sachin posted. TEST_F(ColumnFamilyTest, CreateDropAndDestroy) { Again, this test works fine on 5.3.4 |
@vpallipadi what's the difference between this one and the one you pasted previously? |
@vpallipadi it's very likely that DB destructor has some bug too. The destructing dependencies are more complicated them. But I can't reproduce with this simple test:
If you can find a unit test to reproduce it. I'll dig more and fix it. |
This simple test added to column_family_test reproduces the problem on 5.10.4. TEST_F(ColumnFamilyTest, CreateDropAndDestroy) { Only difference between this one and previous one is use of DisableFileDeletions(). |
Is there some RocksDB release where this issue is resolved ? |
It's not fixed. Looking now. |
I believe the problem is that table readers from dropped column families stick around in table cache since they're not deletable. Meanwhile, their ColumnFamilyOptions has already been removed along with its references to objects like BlockBasedTableFactory, which owns a BlockBasedTableOptions. Let me see if forcing eviction during DropColumnFamily can help. |
Anything we can do to help resolve this one. We are seeing this pretty consistently in our usage. Latest stack trace looks something like this.
|
Looking through the logs, this is the sequence of events that causes the reference to freed object. // eptr error was hit when trying to purge these files after a background compaction. {file_name = "/2281658.sst", path_id = 0} // From rocksdb log, two of these files (2281658 and 2184018) are form cf that is still alive. The third one is from cfh 285_123, which had following recent changes. // Flush {"time_micros": 1526337543266257, "job": 10634, "event": "flush_started", "num_memtables": 1, "num_entries": 25123, "num_deletes": 10040, "memory_usage": 5209192} // CFM Drop {"time_micros": 1526337544999371, "job": 0, "event": "table_file_deletion", "file_number": 2281204} {"time_micros": 1526337544999689, "job": 0, "event": "table_file_deletion", "file_number": 2280640} {"time_micros": 1526337544999924, "job": 0, "event": "table_file_deletion", "file_number": 2280028} {"time_micros": 1526337545000205, "job": 0, "event": "table_file_deletion", "file_number": 2279379} {"time_micros": 1526337545000480, "job": 0, "event": "table_file_deletion", "file_number": 2278782} {"time_micros": 1526337545000528, "job": 0, "event": "table_file_deletion", "file_number": 2278268} {"time_micros": 1526337545007613, "job": 0, "event": "table_file_deletion", "file_number": 2278260} {"time_micros": 1526337545053185, "job": 0, "event": "table_file_deletion", "file_number": 2241786} // However, file 2281663 did not get deleted along with the drop. It tried to delete it immediately afterwards and that caused the crash. |
This commit provides a workaround for the issue, and I cannot reproduce the failure on current master. @ajkr @vpallipadi Update: |
THe problem is that there is not release with this commit |
PR 3888 should fix both cases, and we will discuss how to make this change available to you. Feel free to provide suggestion and comment. Thanks! |
Summary: Please refer to earlier discussion in [issue 3609](#3609). There was also an alternative fix in [PR 3888](#3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes #3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
Summary: Please refer to earlier discussion in [issue 3609](#3609). There was also an alternative fix in [PR 3888](#3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes #3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
Summary: Please refer to earlier discussion in [issue 3609](#3609). There was also an alternative fix in [PR 3888](#3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes #3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
The PR fixing this issue has now been merged back to 5.13.fb. You can try it out and feel free to let us know how it goes. Thanks! |
In RocksDBKVStore we declare the following member variables in the following order: - std::unique_ptr<rocksdb::DB> rdb; - rocksdb::ColumnFamilyOptions defaultCFOptions; - rocksdb::ColumnFamilyOptions seqnoCFOptions; - std::shared_ptr<rocksdb::Cache> blockCache. The member declaration order leads to the inverse destruction sequence 'blockCache->seqnoCFOptions->defaultCFOptions->rdb'. At first look, the issue seems the following: - rdb (in some internal structure) keeps a rocksdb::Cache& to *blockCache (note that a Ref& does not incrememnt the ref-count of a shared_ptr) - given the destruction order, blockCache is releases (ref-count=0) - rdb has still rocksdb::Cache&, which is now referring to released memory When a method of rocksdb::Cache& is called, then we have the 'pure virtual' error. A couple of open RocksDB issues have been identified as possible cause: - facebook/rocksdb#3609 - facebook/rocksdb#3534 As a temporary workaround we destroy 'rdb' as first step in the destruction sequence. Change-Id: I1abd52ffc55c3d8ac41e072b3097541df1d37532 Reviewed-on: http://review.couchbase.org/91010 Reviewed-by: Dave Rigby <[email protected]> Tested-by: Build Bot <[email protected]>
While trying to migrate from 5.3.4 (and older releases) to 5.10.4, we noticed intermittent crash with
pure virtual method called
terminate called without an active exception
while dropping the column family and freeing the cfh.
Narrowing this down, we were able to reproduce this with a simple unit test below, which passes with 5.3.4 and fails with 5.10.4.
TEST_F(ColumnFamilyTest, CreateDropAndDestroy) {
ColumnFamilyHandle* cfh;
Open();
ASSERT_OK(db_->CreateColumnFamily(ColumnFamilyOptions(), "yoyo", &cfh));
ASSERT_OK(db_->Put(WriteOptions(), cfh, "foo", "bar"));
ASSERT_OK(db_->Flush(FlushOptions(), cfh));
ASSERT_OK(db_->DropColumnFamily(cfh));
ASSERT_OK(db_->DestroyColumnFamilyHandle(cfh));
}
Expected behavior
Success.
Actual behavior
Crash with pure virtual method called.
Steps to reproduce the behavior
The above test does work with this change to the test.
@sachja
The text was updated successfully, but these errors were encountered: