-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
at startup, keep duplicates in in-memory index since they will be cleaned shortly #30736
Conversation
33e03b6
to
efef26d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good. A few questions below.
// merge this in, mark as duplicate | ||
duplicates.push((slot, k)); | ||
if current_slot_list.len() == 1 { | ||
Some((current_slot_list, ref_count)) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is actually for the None
arm, but I cannot comment on it (line 1068)
When inserting/updating the disk, I remember you mentioning that we don't need anything on disk with ref counts greater than 1; is that the case here with this logic (where new_ref_count
is always 1 on line 1071)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following. The none arm is for when we are inserting pubkey A but A does not exist on disk yet. So we get a None. Then, we insert it directly to disk with a ref count of 1 (or 0 if entry is cached). This is correct and what we would expect. A second attempt at inserting will leave disk unchanged and add the duplicate to duplicates
. Later, when duplicates
is iterated, we will load the single entry (and 1 refcount) from disk and merge it with the contents of duplicates
, increasing the refcount on the in-memory entry to 2+. The eviction caching logic will prevent us from removing the 2+ refcount entries to disk depending on the tuning of that cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I was trying to answer the question of "the refcount is always 1 here, right?". I didn't know/remember about the cached case, so refcount can be 0 too. Was seeing if an assert here made sense or not. I don't think anything needs to change; was endeavoring to understand more 😺
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Codecov Report
@@ Coverage Diff @@
## master #30736 +/- ##
=======================================
Coverage 81.4% 81.4%
=======================================
Files 723 723
Lines 203533 203537 +4
=======================================
+ Hits 165845 165876 +31
+ Misses 37688 37661 -27 |
…l be cleaned shortly (solana-labs#30736)" This reverts commit 9a1d5ea.
Problem
See #30711
At startup, scan storages to populate index. We can easily identify pubkeys that are in multiple slots (duplicates).
Today, these duplicates are inserted into the disk index. As soon as the first clean runs, the index entries for all these items will be loaded and the older ones will be cleaned (ie. removed) and then the entries will be rewritten. This write/read/rewrite is a waste of i/o bandwidth.
Summary of Changes
Hold all items with duplicates in in-memory index since we know clean will shortly read/modify/write all of them. This improves index generation performance and the performance of the first clean.
Fixes #