-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: resharding v3 read-only state split in test loop #12211
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12211 +/- ##
==========================================
- Coverage 71.83% 71.83% -0.01%
==========================================
Files 827 827
Lines 166626 166675 +49
Branches 166626 166675 +49
==========================================
+ Hits 119704 119736 +32
- Misses 41708 41722 +14
- Partials 5214 5217 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
998bddd
to
9eefdc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
@@ -129,6 +133,8 @@ impl ReshardingManager { | |||
store_update.commit()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to do a single commit
in this function?
I'm thinking of scenarios where the first commit succeeds and the next one or chain_store_update.commit()
do not.
..I know in the end we can just restart the node from an older DB snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice suggestion (and below, too). ChainStoreUpdate actually has helper for that already.
core/store/src/trie/mem/mem_tries.rs
Outdated
/// one. | ||
pub fn freeze( | ||
self, | ||
) -> (FrozenArena, HashMap<StateRoot, Vec<MemTrieNodeId>>, BTreeMap<BlockHeight, Vec<StateRoot>>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add new struct composed by arena, roots and heights; can be reused in from_frozen_arena
@@ -32,12 +33,22 @@ impl<'a, M: ArenaMemory> MemTrieUpdate<'a, M> { | |||
/// responsibility to apply the changes. | |||
pub fn retain_split_shard( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to test correctness of this function as part of more elaborated test loop tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. We could already test that access keys, contract codes and storages correspond to accounts well, but that feels like a separate work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sorry I'm late with the review
self.epoch_manager.will_shard_layout_change(prev_hash)? | ||
&& self.epoch_manager.is_next_block_epoch_start(block.hash())?; | ||
self.epoch_manager.is_next_block_epoch_start(block_hash)? | ||
&& shard_layout != next_shard_layout; | ||
if !next_block_has_new_shard_layout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add debug logs to the early returns? One day, one of them will help with debugging ;)
// TODO(#12019): what if node doesn't have memtrie? just pause | ||
// processing? | ||
// TODO(#12019): fork handling. if epoch is finalized on different | ||
// blocks, the second finalization will crash. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I usually do TODO(resharding)
for generic todos. Yours is totally fine though, especially if you're planning on doing those yourself.
) -> Result<(), StorageError> { | ||
let mut outer_guard = self.0.mem_tries.write().unwrap(); | ||
let Some(memtries) = outer_guard.remove(&source_shard_uid) else { | ||
return Err(StorageError::MemTrieLoadingError("Memtrie not loaded".to_string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you add some context (freeze/resharding) and the source and targets to the error message?
let shard_id = chunk_header.shard_id(); | ||
let shard_index = shard_layout.get_shard_index(shard_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's probably my code and it likely is correct because we only have new chunks here but anyway this pattern makes me uneasy. Even if only because someone may copy paste is somewhere where the assumption is not valid. I'll have another go at those and fix it, I suppose there is more than one such case here.
@@ -99,7 +99,6 @@ impl From<STArena> for HybridArena { | |||
|
|||
impl HybridArena { | |||
/// Function to create a new HybridArena from an existing instance of shared memory in FrozenArena. | |||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
@@ -52,6 +62,15 @@ impl MemTries { | |||
} | |||
} | |||
|
|||
pub fn from_frozen_memtries(shard_uid: ShardUId, frozen_memtries: FrozenMemTries) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment? When dealing with resharding it's a good practice to indicate whether shard_uid is that of the parent or the child.
source_shard_uid: ShardUId, | ||
target_shard_uids: Vec<ShardUId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Are those the parent and children? If so would it make sense to name those that?
let memtries = std::mem::replace(&mut *guard, MemTries::new(source_shard_uid)); | ||
let frozen_memtries = memtries.freeze(); | ||
|
||
for shard_uid in [vec![source_shard_uid], target_shard_uids].concat() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious, why do you need to also freeze the source shard uid? It makes sense by all means, I'm just trying to understand the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not "also" I think. Source shard uid memtrie always has to be frozen to be shared with children. Here we just put frozen memtrie back to the ShardTries
, so that it can be used for proper fork handling.
} | ||
|
||
// If resharding happened, also check that each shard has non-empty state. | ||
for shard_id in 0..3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in shard_layout.shard_ids()
or shard_uids()
(replace shard layout with the right one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in #12221
assert!(items > 0); | ||
} | ||
|
||
return true; | ||
}; | ||
|
||
test_loop.run_until( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR but I would be tempted to pretty print the state (perhaps accounts only) before and after resharding.
Basically a response to #12211 (review).
Freeze memtries for source shard and create correct memtries for target shards after resharding, so that chain state can be properly read after resharding.
This is the major part of "early MVP", after which we can consider testing resharding on real nodes. While resharding isn't properly implemented for receipts, and state uses only frozen memtrie, the live test can already be useful for catching bugs.
Couple more changes:
ChainStoreUpdate
for "chain resharding updates". This is becauseChainStore
has cache for chunk extras which also needs to be updated.