-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dropping rocks column families instead of running compaction #16589
Comments
(oh sad. we need another upstream pr?) |
@carllin hmm, I think the fix should rather be simpler: it seems this is kind of a mem leak bug in the rust-rocksdb. Then, our code just pass around the to-be-dropped column family names to the background cf cleaning thread then do the actual drop cf process there? impl DBWithThreadMode<MultiThreaded> {
/// Creates column family with given name and options
pub fn create_cf<N: AsRef<str>>(&self, name: N, opts: &Options) -> Result<(), Error> {
let inner = self.create_inner_cf_handle(name.as_ref(), opts)?;
self.cfs
.cfs
.write()
.unwrap()
.insert(name.as_ref().to_string(), ColumnFamily { inner });
Ok(())
}
/// Drops the column family with the given name by internally locking the inner column
/// family map. This avoids needing `&mut self` reference
pub fn drop_cf(&self, name: &str) -> Result<(), Error> {
let inner = self.inner;
if let Some(cf) = self.cfs.cfs.write().unwrap().remove(name) {
unsafe {
ffi_try!(ffi::rocksdb_drop_column_family(inner, cf.inner));
+ ffi::rocksdb_column_family_handle_destroy(cf.inner);
}
Ok(())
} else {
Err(Error::new(format!("Invalid column family: {}", name)))
}
}
/// Returns the underlying column family handle
pub fn cf_handle(&self, name: &str) -> Option<BoundColumnFamily> {
self.cfs
.cfs
.read()
.unwrap()
.get(name)
.map(|cf| BoundColumnFamily {
inner: cf.inner,
multi_threaded_cfs: PhantomData,
})
}
} |
@ryoqun I don't think that call to |
I've updated some thoughts on |
okay, I'm creating the 2nd upstream pr for this... |
Why not go to this solution?: #16234 |
I was under the impression that @ryoqun wanted to implement the Work items 1 and 2 will still be useful for improving TransactionStatus column handling -> rocksdb perf for api nodes. |
Yeah agreed, #16234 would be ideal, though I think it's just a trade-off on relative timelines for a stopgap. Here's my estimate: I think the RocksDb drop_cf work could take around two weeks. I think completely transitioning the data shreds to AccountsDb will take much longer. For one AccountsDb is a much more complex piece of code, so it requires time to gain enough understanding to modify it and avoid the pitfalls/subtleties There's also a large number of work items:
I think this work will all have to eventually be done, but I think it's orders of magnitude more work than the column family dropping work. I could see this taking a couple weeks of work for and testing for one person, depending on the complexity of the interactions between the components. I know @ryoqun was looking for a stopgap in the meantime, hence this proposal to scope it out |
@sakridge thanks for bringing the obvious question. I lacked context. As @CriesofCarrots and @carllin said, I wanted just stalls to be fixed asap because partners are having this problem on daily basis.. So, this drop_cf idea came up as a stopgap along side the other custom compaction_filter idea. But then, it seems the drop_cf idea isn't that easy to implement than originally hoped. I think transitioning to AccountsDb will take more time (@steviez how's the work for #16234 going?). As a last try, let me try the compaction_filter stopgap in a day. |
No actual progress on this yet - have been working on other smaller items as stepping stones. At this point, I don't think I can accurately gauge how long the work will take for either approach, or how "bad" the issue is for the partners (obviously any issue is non-ideal but is this a small/medium/large/etc disruption). As such, was looking for opinions & guidance on which path forward seems better (factoring in all the risks / unknowns / fact that people are hitting this in the wild right now) |
@steviez thanks for answering. This will be a challenging and fun engineering task for you. so, using stepping stone is a realistic strategy. I think this problem is adequately bad to assign you to this issue while there are many open issues we'd like to tackle on. :) The significance of the problem is that validator stops processing at all for ~1 hour. So, we'd like to fix the stall in some way or another. Sooner, better. Currently, no one in the team knows the easy-to-implement and robust solution. There is some dilemma, unfortunately. So, you can see some confusion. :) In other words, it's completely up to you to think on how to proceed to tackle this problem. btw, Have you looked our previous attempt for this problem? #9366 although the pr is pretty old and reverted. it might be useful to gauge what's needed to change. |
@ryoqun Thanks and understood - it sounds like maybe this is a situation where it will be better to just start prototyping and fail quickly instead of spinning my wheels over analyzing which approach may be better.
I hadn't seen this - thanks for sending over. |
That's if there are no more problems that we run into to that we have to debug with rocks. It sounds like we ran into a very basic issue in that dropping columns is not freeing the memory. Kind of tells me this path is not well tested. I don't really see this solution as being that straightforward. I think #16602 might be a decent mitigation while we work on this. If we can get that merged, then the amount of data should be reduced in rocksdb and thus less data to compact. |
@carllin I think this issue can be closed now? That's because it seems mostly write stalls are fixed by now, or is there still some concerns?
Also, I think I've finished this part anyway. This relevant fix is finally landed in upstream and available in our codebase starting from: #18904 So, if we really wanted to resume this design, we can start i guess. (not saying we should, to be sure. xD) |
@ryoqun, agreed, closed! |
This issue has been automatically locked since there has not been any activity in past 7 days after it was closed. Please open a new issue for related bugs. |
Problem
Rocks compaction can stall writes when the dataset is sufficiently large
Proposed Solution
Drop column families and create new ones instead of running compaction
Pending work items:
Work Item 1: Upstream change to destroy column family handles
Follow-up to Ryo's upstream change here: rust-rocksdb/rust-rocksdb#506. The current
drop_cf()
implementation in the upstream crate doesn't free the memory. There's a difference between "dropping" the column family and "destroying" all the outstanding Rocksdb handles to the column family, explained here: https://github.com/facebook/rocksdb/blob/master/db/column_family.h#L292-L299.More details below.
Currently, the Rocks crate we're using exposes a
ColumnFamily
objectthat wraps a single reference to a Rocksdb handle/rocksdb pointer to a C++ object: https://github.com/facebook/rocksdb/blob/master/db/c.cc#L973-L974
rocksdb_column_family_handle_t* handle = new rocksdb_column_family_handle_t;
, which aliases an object defined here:where the destructor
~ColumnFamilyHandle()
is what decrements the refcount to the column family and performs cleanup of old files: https://github.com/facebook/rocksdb/blob/master/db/column_family.cc#L68-L84. Thus until this runs, we won't free the memory.In the rust-rocksdb crate however, the destructor however is only invoked through the rocks
rocksdb_column_family_handle_destroy()
https://github.com/facebook/rocksdb/blob/master/db/c.cc#L997-L1000, which is not what's called bydrop_cf
: https://github.com/rust-rocksdb/rust-rocksdb/blob/master/src/db.rs#L687Because we want the ability to destroy the column family without closing the entire database, this seems to imply in the upstream rust-rocksdb crate we would have to both:
Arc<ColumnFamily>
destroy_cf()
methodDrop()
implementation forColumnFamily
that takes a custom handler supplied by the consumers of the crate. SeeWork Item 2
for how we use this.Work Item 2: Safely consume work item 1 in our own services
The compaction/cleanup thread will call
drop_cf()
during cleanup. At this point, ifWork Item 1
is implemented, then the underlying rust-rocksdb crate will drop the reference to theColumnFamily
object from its hashmap of column family handles. This means the only remaining references to theColumnFamily
object are our services that are still performing a lookup concurrently that started the lookup before the call todrop_cf
. Once the last reference to a column family is dropped, the handler we passed to the rust-rocksdb crate (described above inWork Item 1
) will be called.Now what should this handler do? First we need to find a place in our services to safely call destroy on the dropped
ColumnFamily
object. One straightforward way to do this might be to call thedestroy()
method on the dropped column family. However, one thing to note here is we don't want critical threads like ReplayStage or WindowService to eat the latency of dropping the column family, if they were so unfortunate as to hold the last reference to that column family. For instance, let's say we run into a scenario likeSo alternatively the handler we pass could instead use a channel to pipe a signal to our clean up thread. The cleanup thread then will call the newly defined
destroy_cf()
method (described above inWork Item 1
) on the newly dropped column family.Work Item 3: Make
LedgerCleanupService
usedrop_cf()
The tricky part here is figuring out how to bucket slots into different column families.
Unlike transaction statuses which are written out sequentially right now (because ReplayStage replays the blocks in order, so the transaction statuses from replay is also in order), slots can arrive out of order. For instance, if a node is catching up to the tip of the network, they may be repairing/replaying blocks in the slot range
1-1000
, but receiving blocks in the range5000+
.We wouldn't want to group all these disjoint ranges of shreds together in some unpredictable combination in a column family because ultimately the cleanup service is purging a consecutive range of "outdated" slots once the cluster root has moved sufficiently far ahead, or the size of the database has grown past a certain size. This means we need to be able to cleanly support API's like delete range of slots
[N, M]
.Another decision that needs to be made is whether we want to clean based on the number of slots (say if number of slots > some
X
), or the overall size of the outstanding column families has hit a certain size (this is what' done with theTransactionStatus
column family today). The latter seems harder because slots can vary in the number of shreds they contain, and thus the size of each slot can vary dramatically.One naive way to break up multiple column families for something like the data shreds for instance, is to allocate some
N
number of column families and assign a slot range to each of those column families. For instance we could say the first column family contains slots in the range[0-999]
, the second in the range[1000-1999]
, all the way up to[(N-1) * 1000, (N * 1000))
.A few things about this above approach that need refinement:
What if we are replaying slot
5
, but the tip of the network is at(N + 1) * 1000
, which is outside of the range of the last column family? Maybe here we can write those very far future slots into a separate file-backed store that doesn't load RocksDb and cause compaction. Then we can stream them into Rocks later when we start replaying that range of slotsWhen do we drop and recreate each column family? A simple scheme would be to drop the column family at
(M + N - 1) % N)
whereM
is the column family containing the current root/confirmed block. Another option is to start dropping earlier column families if the combined size gets too large.However with the approach in 2, it's hard to drop column families based on size, as we currently do. If some slots in the first column family are prohibitively large, there's no good way to "spill over" into a later column family. Hopefully though, choosing the right interval size, i.e. something like
1000
will give us enough flexibility where the size of a single column family won't get too large, and you can drop ranges as described in2
without losing too much recent history (i.e. if you're replaying slot2000
, you probably don't want to drop slot1999
because others in the cluster may still need access to it, i.e. for repair, RPC queries, etc.)Tag @CriesofCarrots, @ryoqun, @sakridge
context: #14586
The text was updated successfully, but these errors were encountered: