Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable in-memory trie when state sync #10820

Merged
merged 6 commits into from
Mar 27, 2024
Merged

Enable in-memory trie when state sync #10820

merged 6 commits into from
Mar 27, 2024

Conversation

staffik
Copy link
Contributor

@staffik staffik commented Mar 18, 2024

Issue: #10564

Summary
Adds logic to load / unload in-memory tries that works with state sync. Enables in-memory trie with single shard tracking.

Changes

  • Add optional state_root parameter for memtrie loading logic - it's needed when we cannot read the state root from chunk extra.
  • Add load_mem_tries_for_tracked_shards config parameter.
  • Add methods for loading / unloading in-memory tries.
  • Remove obsolete tries from memory before each new state sync.

Follow up tasks

@staffik staffik added the A-stateless-validation Area: stateless validation label Mar 18, 2024
Copy link

codecov bot commented Mar 18, 2024

Codecov Report

Attention: Patch coverage is 69.07895% with 47 lines in your changes are missing coverage. Please review.

Project coverage is 71.52%. Comparing base (c2f9695) to head (9360578).
Report is 4 commits behind head on master.

Files Patch % Lines
core/store/src/trie/shard_tries.rs 71.42% 8 Missing and 2 partials ⚠️
chain/chain/src/runtime/mod.rs 55.00% 9 Missing ⚠️
chain/chain/src/test_utils/kv_runtime.rs 10.00% 9 Missing ⚠️
chain/client/src/client_actions.rs 0.00% 7 Missing ⚠️
chain/client/src/sync_jobs_actions.rs 0.00% 3 Missing ⚠️
chain/chain/src/chain.rs 87.50% 0 Missing and 2 partials ⚠️
core/store/src/trie/mem/loading.rs 71.42% 1 Missing and 1 partial ⚠️
integration-tests/src/genesis_helpers.rs 0.00% 2 Missing ⚠️
chain/client/src/client.rs 94.11% 0 Missing and 1 partial ⚠️
tools/fork-network/src/cli.rs 0.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10820      +/-   ##
==========================================
- Coverage   71.65%   71.52%   -0.13%     
==========================================
  Files         758      759       +1     
  Lines      151950   151585     -365     
  Branches   151950   151585     -365     
==========================================
- Hits       108880   108428     -452     
- Misses      38533    38665     +132     
+ Partials     4537     4492      -45     
Flag Coverage Δ
backward-compatibility 0.24% <0.00%> (+<0.01%) ⬆️
db-migration 0.24% <0.00%> (+<0.01%) ⬆️
genesis-check 1.43% <0.69%> (+<0.01%) ⬆️
integration-tests 37.09% <63.15%> (-0.26%) ⬇️
linux 70.00% <68.42%> (-0.12%) ⬇️
linux-nightly 71.02% <69.07%> (-0.14%) ⬇️
macos 54.45% <59.33%> (-0.16%) ⬇️
pytests 1.66% <0.69%> (+<0.01%) ⬆️
sanity-checks 1.44% <0.69%> (+<0.01%) ⬆️
unittests 67.16% <60.00%> (-0.14%) ⬇️
upgradability 0.29% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@staffik staffik requested review from wacban and robin-near March 21, 2024 00:22
@staffik staffik force-pushed the memtrie-integration-2 branch from 60650ef to 60b6f2f Compare March 21, 2024 13:07
@wacban
Copy link
Contributor

wacban commented Mar 21, 2024

@staffik It's marked as draft, please open it when it's ready for review.

chain/chain/src/garbage_collection.rs Outdated Show resolved Hide resolved
chain/chain/src/garbage_collection.rs Outdated Show resolved Hide resolved
chain/chain/src/garbage_collection.rs Outdated Show resolved Hide resolved
@staffik staffik force-pushed the memtrie-integration-2 branch from 128a3ea to c4f0908 Compare March 21, 2024 23:03
@staffik staffik marked this pull request as ready for review March 21, 2024 23:04
@staffik staffik requested a review from a team as a code owner March 21, 2024 23:04
@staffik staffik changed the title [Draft] Memtrie integration Enable in-memory trie for state sync Mar 21, 2024
@staffik staffik changed the title Enable in-memory trie for state sync Enable in-memory trie when state sync Mar 21, 2024
@staffik staffik requested a review from robin-near March 22, 2024 06:25
@staffik staffik force-pushed the memtrie-integration-2 branch from c4f0908 to 5709629 Compare March 22, 2024 07:05
@wacban wacban requested a review from VanBarbascu March 22, 2024 07:26
@wacban
Copy link
Contributor

wacban commented Mar 22, 2024

@VanBarbascu Can you have a look as well?

@staffik staffik force-pushed the memtrie-integration-2 branch 3 times, most recently from d11728c to 3ab877d Compare March 22, 2024 10:03
Copy link
Contributor

@shreyan-gupta shreyan-gupta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Left a couple of comments

chain/chain/src/runtime/mod.rs Show resolved Hide resolved
chain/client/src/client.rs Show resolved Hide resolved
chain/chunks/src/logic.rs Show resolved Hide resolved
chain/client/src/client_actions.rs Outdated Show resolved Hide resolved
chain/epoch-manager/src/adapter.rs Outdated Show resolved Hide resolved
chain/chain/src/chain.rs Outdated Show resolved Hide resolved
core/store/src/trie/shard_tries.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure the following cases are covered and can you add some integration or nayduck tests for it?

  • node needs to state sync after being offline for too long
  • node needs to state sync to get the shard tracked next epoch
  • node is restarted in the middle of state sync
  • node is restarted during catchup
  • node is restarted after state sync

chain/chain/src/resharding.rs Outdated Show resolved Hide resolved
chain/client/src/client.rs Outdated Show resolved Hide resolved
core/store/src/trie/shard_tries.rs Show resolved Hide resolved
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add some debug logs and metrics for loading and unloading? This has the potential to be something to be under scrutiny in the future :)

core/store/src/config.rs Outdated Show resolved Hide resolved
core/store/src/trie/shard_tries.rs Show resolved Hide resolved
@staffik staffik force-pushed the memtrie-integration-2 branch from 6eac8cd to c256de7 Compare March 26, 2024 14:44
@staffik staffik force-pushed the memtrie-integration-2 branch from ebcf133 to 0fe2430 Compare March 26, 2024 22:31
@@ -2770,6 +2783,8 @@ impl Chain {
);
store_update.commit()?;
flat_storage_manager.create_flat_storage_for_shard(shard_uid).unwrap();
// Flat storage is ready, load memtrie if it is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to execute synchronously? If so, does this happen only during startup?

If this happens not just during startup, is there a way to run this in a different thread? I don't think it is acceptable to pause the chain for 2 minutes.

chain/chunks/src/logic.rs Outdated Show resolved Hide resolved
.iter()
.map(|id| self.epoch_manager.shard_id_to_uid(*id, &epoch_id).unwrap())
.collect();
self.runtime_adapter.retain_mem_tries(&shard_uids);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If for this epoch we track shard 1, and for the next epoch we track 2, we would be starting a state sync for shard 2, but here we would still retain [1, 2], right? So that wouldn't actually unload shard 2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm saying is that we should unload shard 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That can happen if shard 2 was tracked for previous epoch, it is not tracked for this epoch, and it is tracked for the next epoch. Then indeed shard 2 will not be unloaded by retain_mem_tries. That's why we have unload_mem_trie that will unload shard 2 before we apply state parts for shard 2 during state sync in this epoch.

chain/chunks/src/logic.rs Show resolved Hide resolved
@@ -81,6 +81,9 @@ impl SyncJobsActions {
}

pub fn handle_apply_state_parts_request(&mut self, msg: ApplyStatePartsRequest) {
// Unload mem-trie (in case it is still loaded) before we apply state parts.
msg.runtime_adapter.unload_mem_trie(&msg.shard_uid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with the retain_mem_tries earlier? Do I understand this correctly:

  • If in epoch T-1 we track 1, in epoch T we track 2, and in epoch T+1 we track 3, then in epoch T, we retain [2, 3] (unload 1), and here in state sync we unload 3 (in case it's still loaded)
  • If in epoch T-1 we track 1, in epoch T we track 2, and in epoch T+1 we track 1, then in epoch T, we retain [2, 1] (unload nothing), and here in state sync we unload 1

If that understanding is correct, would it work instead to just call retain_mem_tries using only the "this epoch cares" shards, and then omit the unload_mem_trie call here? Or is there some issue with that?

Copy link
Contributor Author

@staffik staffik Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your understanding is correct. We need unload_mem_trie for situation described in #10820 (comment).

core/store/src/trie/config.rs Show resolved Hide resolved
@staffik staffik added this pull request to the merge queue Mar 27, 2024
Merged via the queue into master with commit d4d1b82 Mar 27, 2024
29 of 31 checks passed
@staffik staffik deleted the memtrie-integration-2 branch March 27, 2024 20:34
github-merge-queue bot pushed a commit that referenced this pull request Apr 15, 2024
**Context**
Issue: #10982
Follow up to: #10820.

Modifies StateSync state machine so that memtrie load happens
asynchronously on catchup.

**Summary**
* Split `chain.set_state_finalize()` into:
  * `create_flat_storage_for_shard()`
  * `schedule_load_memtrie()`
  * actual `set_state_finalize()`
* ^ we need it because creating flat storage and state finalize requires
`chain` which cannot be passed in a message to the separate thread.
* Code to trigger memtrie load in a separate thread, analogously to how
apply state parts is done.
* Modify shard sync stages:
  * `StateDownloadScheduling` --> `StateApplyScheduling`
* Just changed the name as it was confusing. What happens there is
scheduling applying of state parts.
  * `StateDownloadApplying` --> `StateApplyComplete`
* What it actually did before was initializing flat storage and
finalizing state update after state apply from previous stage.
* Now it only initializes flat storage and schedules memtrie loading.
  * `StateDownloadComplete` --> `StateApplyFinalizing`
    * Before it was just deciding what next stage to transit into.
* Now it also contains the finalizing state update logic that was
previously in the previous stage.

Integration tests are to be done as a part of:
#10844.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-stateless-validation Area: stateless validation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants