Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: resharding v3 read-only state split in test loop #12211

Merged
merged 14 commits into from
Oct 15, 2024

Conversation

Longarithm
Copy link
Member

@Longarithm Longarithm commented Oct 11, 2024

Freeze memtries for source shard and create correct memtries for target shards after resharding, so that chain state can be properly read after resharding.

This is the major part of "early MVP", after which we can consider testing resharding on real nodes. While resharding isn't properly implemented for receipts, and state uses only frozen memtrie, the live test can already be useful for catching bugs.

Couple more changes:

  • use ChainStoreUpdate for "chain resharding updates". This is because ChainStore has cache for chunk extras which also needs to be updated.
  • on validating chunk endorsements, getting shard id and index after we know that chunk is new,
  • use correct boundary account and intervals for state split,
  • checking that for all shards, state is non-empty.

Copy link

codecov bot commented Oct 11, 2024

Codecov Report

Attention: Patch coverage is 84.70588% with 13 lines in your changes missing coverage. Please review.

Project coverage is 71.83%. Comparing base (04eb2c7) to head (ca3d138).
Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
chain/chain/src/resharding/manager.rs 69.23% 4 Missing and 8 partials ⚠️
core/store/src/trie/shard_tries.rs 95.45% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #12211      +/-   ##
==========================================
- Coverage   71.83%   71.83%   -0.01%     
==========================================
  Files         827      827              
  Lines      166626   166675      +49     
  Branches   166626   166675      +49     
==========================================
+ Hits       119704   119736      +32     
- Misses      41708    41722      +14     
- Partials     5214     5217       +3     
Flag Coverage Δ
backward-compatibility 0.17% <0.00%> (-0.01%) ⬇️
db-migration 0.17% <0.00%> (-0.01%) ⬇️
genesis-check 1.25% <0.00%> (-0.01%) ⬇️
integration-tests 38.93% <84.70%> (+0.16%) ⬆️
linux 71.50% <8.23%> (-0.03%) ⬇️
linux-nightly 71.43% <84.70%> (+0.01%) ⬆️
macos 52.99% <8.23%> (-1.55%) ⬇️
pytests 1.57% <0.00%> (-0.01%) ⬇️
sanity-checks 1.37% <0.00%> (-0.01%) ⬇️
unittests 65.61% <8.23%> (-0.05%) ⬇️
upgradability 0.21% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Longarithm Longarithm marked this pull request as ready for review October 14, 2024 12:56
@Longarithm Longarithm requested a review from a team as a code owner October 14, 2024 12:56
@Longarithm Longarithm changed the title draft: resharding v3 read-only state split in test loop feat: resharding v3 read-only state split in test loop Oct 14, 2024
Copy link
Contributor

@Trisfald Trisfald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@@ -129,6 +133,8 @@ impl ReshardingManager {
store_update.commit()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to do a single commit in this function?
I'm thinking of scenarios where the first commit succeeds and the next one or chain_store_update.commit() do not.
..I know in the end we can just restart the node from an older DB snapshot

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice suggestion (and below, too). ChainStoreUpdate actually has helper for that already.

/// one.
pub fn freeze(
self,
) -> (FrozenArena, HashMap<StateRoot, Vec<MemTrieNodeId>>, BTreeMap<BlockHeight, Vec<StateRoot>>)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add new struct composed by arena, roots and heights; can be reused in from_frozen_arena

@@ -32,12 +33,22 @@ impl<'a, M: ArenaMemory> MemTrieUpdate<'a, M> {
/// responsibility to apply the changes.
pub fn retain_split_shard(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we plan to test correctness of this function as part of more elaborated test loop tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure. We could already test that access keys, contract codes and storages correspond to accounts well, but that feels like a separate work.

@Longarithm Longarithm added this pull request to the merge queue Oct 15, 2024
Merged via the queue into near:master with commit 3221b86 Oct 15, 2024
28 of 30 checks passed
@Longarithm Longarithm deleted the rv3-state-roots branch October 15, 2024 09:18
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, sorry I'm late with the review

self.epoch_manager.will_shard_layout_change(prev_hash)?
&& self.epoch_manager.is_next_block_epoch_start(block.hash())?;
self.epoch_manager.is_next_block_epoch_start(block_hash)?
&& shard_layout != next_shard_layout;
if !next_block_has_new_shard_layout {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add debug logs to the early returns? One day, one of them will help with debugging ;)

Comment on lines +74 to +77
// TODO(#12019): what if node doesn't have memtrie? just pause
// processing?
// TODO(#12019): fork handling. if epoch is finalized on different
// blocks, the second finalization will crash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I usually do TODO(resharding) for generic todos. Yours is totally fine though, especially if you're planning on doing those yourself.

) -> Result<(), StorageError> {
let mut outer_guard = self.0.mem_tries.write().unwrap();
let Some(memtries) = outer_guard.remove(&source_shard_uid) else {
return Err(StorageError::MemTrieLoadingError("Memtrie not loaded".to_string()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you add some context (freeze/resharding) and the source and targets to the error message?

Comment on lines +64 to +65
let shard_id = chunk_header.shard_id();
let shard_index = shard_layout.get_shard_index(shard_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's probably my code and it likely is correct because we only have new chunks here but anyway this pattern makes me uneasy. Even if only because someone may copy paste is somewhere where the assumption is not valid. I'll have another go at those and fix it, I suppose there is more than one such case here.

@@ -99,7 +99,6 @@ impl From<STArena> for HybridArena {

impl HybridArena {
/// Function to create a new HybridArena from an existing instance of shared memory in FrozenArena.
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@@ -52,6 +62,15 @@ impl MemTries {
}
}

pub fn from_frozen_memtries(shard_uid: ShardUId, frozen_memtries: FrozenMemTries) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment? When dealing with resharding it's a good practice to indicate whether shard_uid is that of the parent or the child.

Comment on lines +493 to +494
source_shard_uid: ShardUId,
target_shard_uids: Vec<ShardUId>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Are those the parent and children? If so would it make sense to name those that?

let memtries = std::mem::replace(&mut *guard, MemTries::new(source_shard_uid));
let frozen_memtries = memtries.freeze();

for shard_uid in [vec![source_shard_uid], target_shard_uids].concat() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why do you need to also freeze the source shard uid? It makes sense by all means, I'm just trying to understand the logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not "also" I think. Source shard uid memtrie always has to be frozen to be shared with children. Here we just put frozen memtrie back to the ShardTries, so that it can be used for proper fork handling.

}

// If resharding happened, also check that each shard has non-empty state.
for shard_id in 0..3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in shard_layout.shard_ids() or shard_uids() (replace shard layout with the right one)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in #12221

assert!(items > 0);
}

return true;
};

test_loop.run_until(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but I would be tempted to pretty print the state (perhaps accounts only) before and after resharding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants