feat(resharding): flat storage resharding children catchup #12312

Trisfald · 2024-10-25T09:31:16Z

PR to add children catchup step for flat storages created as a result of a parent shard split.

In previous iterations, the two children shards were populated in a background task from the flat storage of the parent at height last block of old shard layout (post-processing).

Since the task mentioned above takes a long time and the children are active shards in the first block of the new shard layout their flat storage accumulates a lot of deltas.

The catchup step applies delta in the background, then finalizes creation of child flat storage, and triggers a possible memtrie rebuild.

Part of #12174

chain/chain/src/resharding/resharding_actor.rs

Trisfald · 2024-10-25T09:33:03Z

chain/chain/src/resharding/resharding_actor.rs

+    }
+
+    fn handle_memtrie_reload(&self, _shard_uid: ShardUId) {
+        // TODO(resharding)


I thought this may be the entrypoint for memtrie rebuild.

Trisfald · 2024-10-25T09:34:45Z

chain/chain/src/flat_storage_resharder.rs

@@ -1514,4 +1765,200 @@ mod tests {
        );
        assert_eq!(flat_store.get(right_child_shard, &buffered_receipt_key), Ok(None));
    }
+
+    /// Base test scenario for testing children catchup.
+    fn children_catchup_base(with_restart: bool) {


The most relevant change in tests is .. the addition of this test.

Trisfald · 2024-10-25T09:36:53Z

chain/chain/src/flat_storage_resharder.rs

+                    merged_changes.merge(changes);
+                    store_update.remove_delta(shard_uid, flat_head_block_hash);
+                }
+                // TODO (resharding): if flat_head_block_hash == state sync hash -> do snapshot


@marcelo-gonzalez This one is the designated point to trigger the snapshot. We can discuss how it can be done in practice, the important consideration is that here the flat storage has all the state needed.

Ideally we should do this in client. This is perhaps not the best place due to potential race conditions.

codecov · 2024-10-25T12:19:43Z

Codecov Report

Attention: Patch coverage is 92.41379% with 33 lines in your changes missing coverage. Please review.

Project coverage is 71.29%. Comparing base (1e41b11) to head (e36dbba).

Files with missing lines	Patch %	Lines
chain/chain/src/flat_storage_resharder.rs	92.74%	16 Missing and 11 partials ⚠️
chain/chain/src/resharding/resharding_actor.rs	89.36%	5 Missing ⚠️
chain/chain/src/resharding/manager.rs	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #12312      +/-   ##
==========================================
+ Coverage   71.24%   71.29%   +0.05%     
==========================================
  Files         838      838              
  Lines      169346   169705     +359     
  Branches   169346   169705     +359     
==========================================
+ Hits       120651   120998     +347     
+ Misses      43449    43446       -3     
- Partials     5246     5261      +15

Flag	Coverage Δ
backward-compatibility	`0.16% <0.00%> (-0.01%)`	⬇️
db-migration	`0.16% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.22% <0.00%> (-0.01%)`	⬇️
integration-tests	`39.01% <29.42%> (+<0.01%)`	⬆️
linux	`70.69% <81.83%> (+0.02%)`	⬆️
linux-nightly	`70.87% <92.41%> (+0.04%)`	⬆️
macos	`50.47% <81.29%> (+0.06%)`	⬆️
pytests	`1.54% <0.00%> (-0.01%)`	⬇️
sanity-checks	`1.35% <0.00%> (-0.01%)`	⬇️
unittests	`64.21% <81.29%> (+0.03%)`	⬆️
upgradability	`0.21% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

shreyan-gupta

Would love others to take a look as well

shreyan-gupta · 2024-10-28T10:27:04Z

chain/chain/src/resharding/resharding_actor.rs

+                // Shard catchup task is delayed and could get postponed several times. This must be
+                // done to cover the scenario in which catchup is triggered so fast that the initial
+                // state of the new flat storage is beyond the chain final tip.
+                ctx.run_later(


This isn't how run_later works. The only thing the current code does is call the handle_flat_storage_shard_catchup after 100 ms.

What we ideally want is something like

Check some condition to see if we can run handle_flat_storage_shard_catchup. If true, then just run handle_flat_storage_shard_catchup and return

If false, then call ctx.run_later recursively so that the same condition can be checked again later.

Maybe run_later is misused here. This has been my tentative solution to the following problem.

Originally, I was calling directly handle_flat_storage_shard_catchup and waiting some time inside the function. But it was failing in test loop for resharding_v3 because the test was getting stuck inside this function call. The other actors weren't progressing at all and the condition to resume catchup wasn't satisfied ever.

Basically I need a way to postpone execution of catchup until the canonical chain makes enough progress. I found that run_later achieves that by allowing other actors to make progress in the meantime.

Should I rather change the test_loop test to avoid this problem?
Please feel free to suggest a better approach (even offline)

This is an example of how run_later is used. We can discuss this offline

chain/chain/src/resharding/resharding_actor.rs

chain/chain/src/flat_storage_resharder.rs

shreyan-gupta · 2024-10-28T10:40:14Z

chain/chain/src/flat_storage_resharder.rs

+        &self,
+        shard_uid: ShardUId,
+        flat_head_block_hash: CryptoHash,
+        chain_store: &ChainStore,


Aghh... another instance of ChainStore being passed around! :(
Now I really want to land #12159 soon!
(Just personal rant)

I'd love to not have to pass this chain_store thingy around 🤣

chain/chain/src/flat_storage_resharder.rs

shreyan-gupta · 2024-10-28T10:54:48Z

chain/chain/src/flat_storage_resharder.rs

+            merged_changes.apply_to_flat_state(&mut store_update, shard_uid);
+            store_update.set_flat_storage_status(
+                shard_uid,
+                FlatStorageStatus::Resharding(FlatStorageReshardingStatus::CatchingUp(


Quick sanity check, when do we set the status to CatchingUp? Should this be at the beginning of the apply_deltas function or here? Also, shouldn't we do this once instead of doing it for each batch?

We set status to CatchingUp the first time in split_shard_task_postprocessing after all key values have been split successfully.

Then we do it in each batch to update the block has of the flat head.

shreyan-gupta · 2024-10-28T11:02:05Z

chain/chain/src/flat_storage_resharder.rs

+        };
+        if height > chain_final_head.height {
+            info!(target = "resharding", ?height, chain_final_height = ?chain_final_head.height, "flat head beyond chain final tip: postponing flat storage shard catchup task");
+            self.scheduler.send(ReshardingRequest::FlatStorageShardCatchup {


Wait, if this check is saying height is already beyond final head, why are we calling the schedule here?

Oh this is a surprise I got from test loop resharding test.
It shouldn't happen in practice but the logic is the following:
If the split operation is very fast, since we split parent flat head and deltas together, the children flat storage after the split may have their flat head beyond the final block height. We don't want that because of the invariant that flat head <= final block height, so we delay the catchup until the canonical chain contains the children flat head hash.

Hmm, I see, well, I don't think this should be the solution. Instead what we should do is delay and retry the original split flat storage request (the run_later logic) for FlatStorageSplitShard and then we can hopefully remove this code and the retry logic for FlatStorageShardCatchup (assuming FlatStorageShardCatchup is called AFTER split is completed)

*retry when height <= chain head height

shreyan-gupta · 2024-10-28T11:07:47Z

chain/chain/src/flat_storage_resharder.rs

+        let mut deltas_gc_count = 0;
+        for delta_metadata in deltas_metadata {
+            if delta_metadata.block.height <= flat_head.height {
+                store_update.remove_delta(shard_uid, delta_metadata.block.hash);


I would like @Longarithm to take a look at this function and check whether it looks good

chain/chain/src/resharding/resharding_actor.rs

shreyan-gupta · 2024-10-28T11:11:43Z

chain/chain/src/resharding/resharding_actor.rs


 /// Dedicated actor for resharding V3.
-pub struct ReshardingActor {}
+pub struct ReshardingActor {
+    chain_store: ChainStore,


This doesn't look right. If resharder is the entity that requires the chain store, shouldn't it hold chain store instead of ReshardingActor?

Originally I thought chain_store might be used outside of resharder. If passing it around is not too complicated I can move it inside resharder

wacban

Looks good at a first glance, I added a few comments for now.

chain/chain/src/flat_storage_resharder.rs

wacban · 2024-10-28T11:02:16Z

chain/chain/src/flat_storage_resharder.rs

                        )),
                    );
+                    self.scheduler.send(ReshardingRequest::FlatStorageShardCatchup {


Is this going from the resharding actor to itself?

It becomes really really hard to figure out which of these functions are called from within resharding actor and which aren't. That's why I wanted us to keep all functions executed in the actor as part of the actor instead of here.

Even just for this review, I had to bounce around several times to see who is calling who and when does a resharding request get converted to a function call in flat_storage_resharder

Yeah now that you mentioned it I also struggled to grasp what's happening where. Do you think this should be addressed in this PR or as a follow up? @shreyan-gupta @Trisfald

I think we can follow up in a separate PR, try to get this in first

chain/chain/src/flat_storage_resharder.rs

wacban · 2024-10-28T11:06:33Z

chain/chain/src/flat_storage_resharder.rs

+        // If the flat head is not in the canonical chain this task has failed.
+        match chain_store.get_block_hash_by_height(height) {


That isn't what I expected the check for canonical to look like but alright :)

What were the alternatives @wacban?

Oh I don't know, it just surprised me. There is a function call is_on_current_chain in chain that seems to work in a similar way.

chain/chain/src/flat_storage_resharder.rs

wacban · 2024-10-28T11:22:04Z

chain/chain/src/flat_storage_resharder.rs

+            let new_account_left_child = account!(format!("oo{}", height));
+            let state_changes_left_child = vec![RawStateChangesWithTrieKey {
+                trie_key: TrieKey::Account { account_id: new_account_left_child.clone() },
+                changes: vec![RawStateChange {
+                    cause: StateChangeCause::InitialState,
+                    data: Some(new_account_left_child.as_bytes().to_vec()),
+                }],
+            }];
+            manager
+                .save_flat_state_changes(
+                    block_hash,
+                    prev_hash,
+                    height,
+                    left_child_shard,
+                    &state_changes_left_child,
+                )
+                .unwrap()
+                .commit()
+                .unwrap();


nit: This is copy pasted, can you refactor to a method?

wacban · 2024-10-28T11:22:54Z

chain/chain/src/resharding/resharding_actor.rs

+                // Shard catchup task is delayed and could get postponed several times. This must be
+                // done to cover the scenario in which catchup is triggered so fast that the initial
+                // state of the new flat storage is beyond the chain final tip.


omg, thanks for the comment

chain/chain/src/resharding/resharding_actor.rs

chain/chain/src/resharding/types.rs

…ding-children

Trisfald · 2024-10-31T09:28:39Z

I have address the most immediate feedback in the PR.

As discussed offline, major changes such as delaying the split shard task instead of the catchup task will be done in another PR.

Trisfald added 3 commits October 22, 2024 13:46

skeleton for resharding shrad catchup test

d36af0f

add new types of resharding requests

5896227

add implementation and fix tests for flat storage catchup

66115d1

Trisfald commented Oct 25, 2024

View reviewed changes

chain/chain/src/resharding/resharding_actor.rs Outdated Show resolved Hide resolved

Trisfald commented Oct 25, 2024

View reviewed changes

handle very-fast-resharding scenario

0436058

Trisfald marked this pull request as ready for review October 25, 2024 13:11

Trisfald requested a review from a team as a code owner October 25, 2024 13:11

Trisfald requested review from akhi3030, wacban and shreyan-gupta October 25, 2024 13:11

Trisfald and others added 2 commits October 28, 2024 10:26

Merge branch 'master' into flat-storage-resharding-children

0982dd2

change panic into result

2c119c8

shreyan-gupta reviewed Oct 28, 2024

View reviewed changes

wacban reviewed Oct 28, 2024

View reviewed changes

Trisfald requested a review from Longarithm October 29, 2024 09:38

Trisfald added 10 commits October 29, 2024 10:57

remove matches! macro, expand comments

38eb03c

rename block hash into resharding hash

8dedd0a

remove prev_block_hash as it was not used

22cbfed

update comment

8e18cf1

feedback

0778230

update protocol schema

ebbc075

feedback

80d6bd6

degrade error into debug assert

f43476e

split resharding request into seperate requests

4a34913

Merge remote-tracking branch 'origin/master' into flat-storage-reshar…

e36dbba

…ding-children

Trisfald requested review from wacban and shreyan-gupta October 31, 2024 09:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(resharding): flat storage resharding children catchup #12312

feat(resharding): flat storage resharding children catchup #12312

Trisfald commented Oct 25, 2024

Trisfald Oct 25, 2024

Trisfald Oct 25, 2024

Trisfald Oct 25, 2024

shreyan-gupta Oct 28, 2024

codecov bot commented Oct 25, 2024 •

edited

Loading

shreyan-gupta left a comment

shreyan-gupta Oct 28, 2024

Trisfald Oct 29, 2024

shreyan-gupta Oct 29, 2024 •

edited

Loading

shreyan-gupta Oct 28, 2024

Trisfald Oct 29, 2024

shreyan-gupta Oct 28, 2024

Trisfald Oct 29, 2024

shreyan-gupta Oct 28, 2024

Trisfald Oct 29, 2024

shreyan-gupta Oct 29, 2024

shreyan-gupta Oct 29, 2024

shreyan-gupta Oct 28, 2024

shreyan-gupta Oct 28, 2024

Trisfald Oct 29, 2024

wacban left a comment

wacban Oct 28, 2024

shreyan-gupta Oct 28, 2024

shreyan-gupta Oct 28, 2024

wacban Oct 28, 2024

shreyan-gupta Oct 29, 2024

wacban Oct 28, 2024

shreyan-gupta Oct 28, 2024

wacban Oct 28, 2024

wacban Oct 28, 2024

Trisfald Oct 29, 2024

wacban Oct 28, 2024

Trisfald commented Oct 31, 2024

		// If the flat head is not in the canonical chain this task has failed.
		match chain_store.get_block_hash_by_height(height) {

feat(resharding): flat storage resharding children catchup #12312

Are you sure you want to change the base?

feat(resharding): flat storage resharding children catchup #12312

Conversation

Trisfald commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 25, 2024 • edited Loading

Codecov Report

shreyan-gupta left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shreyan-gupta Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Trisfald commented Oct 31, 2024

codecov bot commented Oct 25, 2024 •

edited

Loading

shreyan-gupta Oct 29, 2024 •

edited

Loading