Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance problem with get_slots_since in high forking #24878

Closed
sakridge opened this issue May 2, 2022 · 6 comments
Closed

Performance problem with get_slots_since in high forking #24878

sakridge opened this issue May 2, 2022 · 6 comments
Assignees
Labels
stale [bot only] Added to stale content; results in auto-close after a week.

Comments

@sakridge
Copy link
Member

sakridge commented May 2, 2022

Problem

When there are a lot of non-rooted slots. get_slots_since starts to take longer (100s of ms):

get-slots-since

Proposed Solution

Seems like we might be doing some extra work here or could be more efficient in storing the delta. We store what we captured last and then start adding new slots to a new set which we can query for the next call.

@yhchiang-sol
Copy link
Contributor

yhchiang-sol commented May 3, 2022

Thanks for raising the issue. I will take a deeper look.

I happen to have one node that enables RocksDB perf metrics that might provide some relevant information.

When get_slots_since getting slower, we spent more time on reading table files (get_from_output_files_nanos) and creating new iterators (new_table_block_iter_nanos):
Screen Shot 2022-05-03 at 4 08 44 PM

compared to the normal situation where it spent more time getting data from the mem-table (get_from_memtable_nanos) -- this also means we are mostly reading recent data in a normal situation where data are still in the in-memory buffer.

Screen Shot 2022-05-03 at 4 10 28 PM

Column-family-wise, index, slot-meta, and erasure-meta` column families seem contribute the most during same period of time.
Screen Shot 2022-05-03 at 4 25 18 PM

@yhchiang-sol
Copy link
Contributor

Looks like we are issuing multiple get operations sequentially in get_slots_since(). One possible way to improve it is to use RocksDB's batch-version MultiGet API, but RocksDB's C and rust API only has the unbatched version of MultiGet. Let me see if I can make this available in C and rust API.

@yhchiang-sol
Copy link
Contributor

Created facebook/rocksdb#9952 that makes the batch-version MultiGet API available in RocksDB's C API.

@yhchiang-sol
Copy link
Contributor

The PR for rust-rocksdb: rust-rocksdb/rust-rocksdb#633

@steviez
Copy link
Contributor

steviez commented Sep 19, 2022

@yhchiang-sol - While the optimization to use multi_get() definitely showed improvements, I'm not sure it is exactly what Stephen had in mind.

In ReplayStage::generate_new_bank_forks(), the list of slots to Blockstore::get_slots_since() come from the below call:

        let frozen_bank_slots: Vec<u64> = frozen_banks
            .keys()
            .cloned()
            .filter(|s| *s >= forks.root())
            .collect();

When the network/node is healthy, frozen banks are getting rooted and thus pruned from bank_forks at the same rate that new banks are getting frozen. So, the number of slots passed to the blockstore query is roughly constant.

When roots stop getting made, the pruning does not happen, and we begin to accumulate frozen banks in bank_forks. This has other repercussions (see #23061) as well, but in the context of this issue, it means we are are passing lots of repeat keys to Blockstore::get_slots_since().

So, the optimization of get_slots_since() is good, but we could potentially realize even larger savings by being smarter about avoiding repeated queries. That being said, after thinking about this for a minute, I think we would need a fairly drastic change to how we discover new children to avoid needing to pass all frozen banks to the query every execution.

For example, consider the following sequence where bankX refers to a bank at some slot X:
(a) bankX is not frozen, and as such, isn't yielded in the frozen_bank_slots list when generate_new_bank_forks() called
(b) bankX is completed and frozen
(c) On the next iteration of generate_new_bank_forks(), X is included in frozen_bank_slots and as such, all of X's children will have banks created
(d) Some future slot Z is created whose parent is X
(e) generate_new_bank_forks() is called again which calls get_slots_since()

The initial idea was that since we had seen X's children in (c), we could potentially exclude X from the get_slots_since() call in (e). However, with our current discovery mechanism, we'd have to query slot X again to learn about Z.

We actually encounter this kind of scenario that would break things around epoch boundaries, where we might have several forks in the new epoch (we've seen one per leader for the first several leaders) all building off the last slot in the previous epoch.

Haven't given much thought to alternatives, but also maybe not worth spending additional cycles on the moment? Maybe we just take the gains we get with multi_get()

@yhchiang-sol
Copy link
Contributor

So, the optimization of get_slots_since() is good, but we could potentially realize even larger savings by being smarter about avoiding repeated queries.

Look like a good opportunity here! RocksDB's batched multi_get() also comes with an optional boolean indicating whether the input keys are already sorted. If so, then internally RocksDB will skip the sorting.

   virtual void MultiGet(const ReadOptions& options,
                         ColumnFamilyHandle* column_family,
                         const size_t num_keys, const Slice* keys,
                         PinnableSlice* values, Status* statuses,
                         const bool /*sorted_input*/ = false);

And, luckily, I did port this optional flag to rust-rocksdb, so we can use it without waiting for another release!

     /// Return the values associated with the given keys and the specified column family
     /// where internally the read requests are processed in batch if block-based table
     /// SST format is used.  It is a more optimized version of multi_get_cf.
     pub fn batched_multi_get_cf<K, I>(
         &self,
         cf: &impl AsColumnFamilyRef,
         keys: I,
         sorted_input: bool,
     ) -> Vec<Result<Option<DBPinnableSlice>, Error>>

If we also want to avoid repeated queries, then sorting before passing it to multi_get() makes a lot of sense. By doing this, we can solve two problems at once!

yhchiang-sol added a commit that referenced this issue Sep 20, 2022
…27686)

#### Problem
The current implementation of get_slots_since() invokes multiple rocksdb::get().
As a result, each get() operation may end up requiring one disk read.  This leads
to poor performance of get_slots_since described in #24878.

#### Summary of Changes
This PR makes get_slots_since() use the batched version of multi_get() instead,
which allows multiple get operations to be processed in batch so that they can
be answered with fewer disk reads.
@github-actions github-actions bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Sep 20, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

No branches or pull requests

3 participants