Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: stateless validation jobs in test mode #10248

Merged
merged 17 commits into from
Dec 12, 2023
Merged

Conversation

Longarithm
Copy link
Member

@Longarithm Longarithm commented Nov 24, 2023

This is a next step for #9982.
Here I introduce jobs which will perform stateless validation of newly received chunk by executing txs and receipts.
Later they should be executed against state witness, but for now I just set a foundation by running these jobs against state data in DB. All passing tests verify that old and new jobs generate the same result.
The final switch will happen when stateful jobs will be replaced with stateless ones.

Details

This doesn't introduce any load on stable version. On nightly version there will be num_shards extra jobs which will check that stateless validation results are consistent with stateful execution. But as we use nightly only for testing, it shouldn't mean much overhead.

I add more fields to ShardContext structure to simplify code. Some of them are needed to break early if there is resharding, and the logic is the same for both kinds of jobs.

StorageDataSource::DbTrieOnly is introduced to read data only from trie in stateless jobs. This is annoying but still needed if there are a lot of missing chunks and flat storage head moved above the block at which previous chunk was created. When state witness will be implemented, Recorded will be used instead.

Testing

  • Failure to update current_chunk_extra on the way leads to >20 tests failing in process_blocks, with errors like assertion left == right failed: For stateless validation, chunk extras for block CMV88CBcnKoxa7eTnkG64psLoJzpW9JeAhFrZBVv6zDc and shard s3.v2 do not match...
  • If I update current_chunk_extra only once, tests::client::resharding::test_latest_protocol_missing_chunks_high_missing_prob fails which was specifically introduced for that. Actually this helped to realize that validate_chunk_with_chunk_extra is still needed but I will introduce it later.
  • Nayduck: https://nayduck.near.org/#/run/3293 - +10 nightly tests failing, will take a look https://nayduck.near.org/#/run/3300

Copy link
Contributor

@pugachAG pugachAG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall everything makes sense, added a couple of suggestions

})
.collect()
.collect::<Result<Vec<Vec<UpdateShardJob>>, Error>>()?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using functional style iterator at this point only makes it more confusing, I suggest rewriting this using plain for loop

chain/chain/src/update_shard.rs Show resolved Hide resolved
@@ -232,8 +252,8 @@ fn apply_old_chunk(

let storage_config = RuntimeStorageConfig {
state_root: *prev_chunk_extra.state_root(),
use_flat_storage: true,
source: crate::types::StorageDataSource::Db,
use_flat_storage: shard_context.storage_data_source == StorageDataSource::Db,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep use_flat_storage: true here and move this logic as part of apply_transactions implementation:

StorageDataSource::DbTrieOnly => {
   let mut trie = self.get_trie_for_shard(
                shard_id,
                prev_block_hash,
                storage_config.state_root,
                false,
            );
  trie.dont_charge_gas_for_trie_node_access();
  trie;
}

just to avoid duplication and changing it back after we have state witness

@Longarithm
Copy link
Member Author

I think I understood why some tests failed in Nayduck - because I didn't comment not supported cases well enough. New attempt is here: https://nayduck.near.org/#/run/3298

@dsuggs-near
Copy link
Contributor

I've turned off automatic merging altogether until I understand this problem.

@dsuggs-near dsuggs-near closed this Dec 4, 2023
@Longarithm Longarithm reopened this Dec 8, 2023
@Longarithm Longarithm marked this pull request as ready for review December 8, 2023 03:45
@Longarithm Longarithm requested a review from a team as a code owner December 8, 2023 03:45
@Longarithm Longarithm requested a review from akhi3030 December 8, 2023 03:45
@Longarithm Longarithm changed the title draft: stateless validation jobs in test mode feat: stateless validation jobs in test mode Dec 8, 2023
@Longarithm Longarithm requested a review from pugachAG December 8, 2023 03:48
@Longarithm
Copy link
Member Author

New logic also breaks hidden assumption that order of jobs must correspond to range of shard ids, which is not desired anyway, so I remove this assumption by returning shard id with a job explicitly.

@akhi3030
Copy link
Collaborator

I am not planning on reviewing this PR so removing my review request. Let me know if my input is actually needed.

@akhi3030 akhi3030 removed their request for review December 11, 2023 10:01
Copy link
Contributor

@pugachAG pugachAG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


/// Result for a shard update for a single block.
#[derive(Debug)]
pub enum ShardBlockUpdateResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe ShardChunkUpdateResult would be more precise

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose ShardBlockUpdateResult explicitly because chunk is not necessarily present. Especially if there is a state split. So I'll bet on that for now

@@ -106,15 +131,25 @@ pub(crate) fn process_shard_update(
epoch_manager: &dyn EpochManagerAdapter,
shard_update_reason: ShardUpdateReason,
shard_context: ShardContext,
state_patch: SandboxStatePatch,
) -> Result<ApplyChunkResult, Error> {
storage_context: StorageContext,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would move storage_context to NewChunkData and OldChunkData since it is only needed there

@@ -115,7 +115,7 @@ impl FlatStorageManager {
?new_flat_head,
?err,
?shard_uid,
block_hash = ?block.header().hash(),
// block_hash = ?block.header().hash(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops

let state_patch = state_patch.take();

// Insert job into list of all jobs to run, if job is valid.
let mut insert_job = |job: Result<Option<UpdateShardJob>, Error>| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this looks weird 🙃 can we just accumulate job_results: Vec<Result<Option<UpdateShardJob>, Error>> and then extract jobs and invalid_chunks after this loop?

Copy link

codecov bot commented Dec 11, 2023

Codecov Report

Attention: 53 lines in your changes are missing coverage. Please review.

Comparison is base (5ce0d8a) 71.83% compared to head (c694b0e) 71.83%.

Files Patch % Lines
chain/chain/src/chain.rs 85.57% 19 Missing and 27 partials ⚠️
core/primitives/src/challenge.rs 0.00% 3 Missing ⚠️
chain/chain/src/update_shard.rs 95.83% 1 Missing ⚠️
chain/client/src/client_actor.rs 0.00% 1 Missing ⚠️
core/store/src/trie/mod.rs 75.00% 0 Missing and 1 partial ⚠️
nearcore/src/runtime/mod.rs 87.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #10248    +/-   ##
========================================
  Coverage   71.83%   71.83%            
========================================
  Files         712      712            
  Lines      143171   143442   +271     
  Branches   143171   143442   +271     
========================================
+ Hits       102843   103043   +200     
- Misses      35627    35669    +42     
- Partials     4701     4730    +29     
Flag Coverage Δ
backward-compatibility 0.08% <0.00%> (-0.01%) ⬇️
db-migration 0.08% <0.00%> (-0.01%) ⬇️
genesis-check 1.26% <0.00%> (-0.01%) ⬇️
integration-tests 36.24% <85.39%> (+0.15%) ⬆️
linux 71.59% <33.33%> (-0.12%) ⬇️
linux-nightly 71.40% <85.39%> (-0.01%) ⬇️
macos 55.10% <31.95%> (-0.66%) ⬇️
pytests 1.49% <0.00%> (-0.01%) ⬇️
sanity-checks 1.28% <0.00%> (-0.01%) ⬇️
unittests 68.21% <82.64%> (+<0.01%) ⬆️
upgradability 0.13% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Longarithm Longarithm enabled auto-merge December 12, 2023 00:18
@Longarithm Longarithm added this pull request to the merge queue Dec 12, 2023
Merged via the queue into near:master with commit 5595df6 Dec 12, 2023
16 checks passed
@Longarithm Longarithm deleted the new-jobs branch December 12, 2023 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants