fix(state-sync): Make partial chunks message deterministic #9427

nikurt · 2023-08-14T12:07:10Z

The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

Shard ids are in a HashSet
In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.

wacban

Rather than sorting here and there can you use a sorted data structure?

wacban · 2023-08-14T13:07:48Z

core/primitives/src/sharding.rs

+        (self.1.from_shard_id, self.1.to_shard_id)
+            .cmp(&(other.1.from_shard_id, other.1.to_shard_id))


This doesn't seem sufficiently ordered, there can be multiple receitpts with the same from and to shard ids and thus still be non-deterministic. How about you also compare the hashes?

The receipts are grouped by shard ids of outgoing and incoming:

pub struct ReceiptProof(pub Vec<Receipt>, pub ShardProof);

@wacban or is your suggestion to order the receipts inside ReceiptProof? 🤔

I see now what's going on. You're claiming that there is at most one ReceiptProof per every pair (from, to) and as such the comparison is actually strict. As long as the assumption is correct I'm fine with this implementation.

I was not suggesting sorting the receipts but that doesn't sound too bad just to be extra cautious. I lack context to say if it is actually needed so I'll leave it up to your best judgment.

pytest/tests/sanity/state_sync_then_catchup.py

wacban · 2023-08-15T07:56:27Z

pytest/tests/sanity/state_sync_then_catchup.py

+
+state_parts_dir = str(pathlib.Path(tempfile.gettempdir()) / 'state_parts')
+
+config0 = {


nit: refactor config generation to a helper function (most of the two configs is identical)

More than half of the fields are different or have different values:

state_sync.dump vs state_sync.sync

store.state_snapshot_enabled

tracked_shards

traced_shard_schedule

state_sync_enabled

consensus.state_sync_timeout

wacban · 2023-08-15T08:00:14Z

pytest/tests/sanity/state_sync_then_catchup.py

@@ -0,0 +1,200 @@
+#!/usr/bin/env python3


nit: It's good practice to wrap the code in a unittest and a _ _ main _ _ like so:
https://github.com/near/nearcore/tree/master/pytest#creating-new-tests

Added main(), thanks!
Nayduck doesn't expect these tests to be unittests.

I think as long as you run unittest.run() from within main it should work fine. Did you try and did it not work? Maybe you also need to put the test logic in a method with name beginning with test_. Not a biggie though, don't worry about it too much.

wacban · 2023-08-15T08:01:38Z

pytest/tests/sanity/state_sync_then_catchup.py

+    if block_height == 0:
+        return 0
+    if block_height <= EPOCH_LENGTH:
+        # According to the protocol specifications, there are two epochs with height 1.


wacban · 2023-08-15T08:05:49Z

pytest/tests/sanity/state_sync_then_catchup.py

+        if (len(keys) > 100 and random.random() < 0.2) or len(keys) > 1000:
+            key = keys[random.randint(0, len(keys) - 1)]
+            call_function('read', key, nonce, boot_node.signer_key,
+                          last_block_hash)
+            call_function('read', key, nonce, node1.signer_key, last_block_hash)
+        elif random.random() < 0.5:
+            if random.random() < 0.3:
+                key_from, account_to = boot_node.signer_key, node1.signer_key.account_id
+            elif random.random() < 0.3:
+                key_from, account_to = boot_node.signer_key, "near"
+            elif random.random() < 0.5:
+                key_from, account_to = node1.signer_key, boot_node.signer_key.account_id
+            else:
+                key_from, account_to = node1.signer_key, "near"
+            payment_tx = transaction.sign_payment_tx(key_from, account_to, 1,
+                                                     nonce, last_block_hash)
+            boot_node.send_tx(payment_tx).get('result')
+        else:
+            key = random_u64()
+            keys.append(key)
+            call_function('write', key, nonce, boot_node.signer_key,
+                          last_block_hash)
+            call_function('write', key, nonce, node1.signer_key,
+                          last_block_hash)


Can you add a small comment on what exactly are you doing here?
Also nit: early returns may flatten this code a bit.

Restructured and simplified and commented.

wacban · 2023-08-15T08:08:34Z

pytest/tests/sanity/state_sync_then_catchup.py

+node1_height = node1.get_latest_block().height
+logger.info(f'started node1@{node1_height}')
+
+nonce, keys = random_workload_until(int(EPOCH_LENGTH * 3.7), nonce, keys, node1)


Are there any checks that you also need to do here or does the workload do it?

No checks needed. Without the fix the node simply crashes and doesn't reach the target height.

…_state (#9419) For more context on the change, please look at #9420 and #9418

wacban

approving as looks good but you might have mixed up some extra code from other pr, please review carefully

chain/chain/src/resharding.rs

nearcore/src/runtime/mod.rs

wacban · 2023-08-15T11:29:43Z

pytest/tests/sanity/state_sync_then_catchup.py

@@ -0,0 +1,200 @@
+#!/usr/bin/env python3


I think as long as you run unittest.run() from within main it should work fine. Did you try and did it not work? Maybe you also need to put the test logic in a method with name beginning with test_. Not a biggie though, don't worry about it too much.

The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.

nikurt added 2 commits August 14, 2023 14:04

fix(state-sync): Make partial chunks message deterministic

0c17075

fix(state-sync): Make partial chunks message deterministic

5435ce8

nikurt requested a review from a team as a code owner August 14, 2023 12:07

nikurt requested a review from jakmeier August 14, 2023 12:07

nikurt marked this pull request as draft August 14, 2023 12:08

nikurt removed the request for review from jakmeier August 14, 2023 12:08

Test

f235239

nikurt marked this pull request as ready for review August 14, 2023 12:46

Merge branch 'master' into nikurt-order-partial-chunks

98ff1ac

nikurt requested a review from jakmeier August 14, 2023 12:50

wacban reviewed Aug 14, 2023

View reviewed changes

nikurt requested a review from wacban August 14, 2023 13:12

Comments

007a527

wacban reviewed Aug 15, 2023

View reviewed changes

Shreyan Gupta and others added 2 commits August 15, 2023 12:21

[resharding] Include state_changes in ApplySplitStateResult for split…

3386fb1

…_state (#9419) For more context on the change, please look at #9420 and #9418

python readability

792a331

nikurt requested a review from wacban August 15, 2023 10:23

wacban approved these changes Aug 15, 2023

View reviewed changes

nikurt and others added 2 commits August 15, 2023 13:34

python readability

6b71549

Merge branch 'master' into nikurt-order-partial-chunks

d256c7b

nikurt added the S-automerge label Aug 17, 2023

near-bulldozer bot added 2 commits August 17, 2023 10:53

Merge master into nikurt-order-partial-chunks

48f12a2

Merge refs/heads/master into nikurt-order-partial-chunks

b074344

near-bulldozer bot merged commit 0759d2f into master Aug 17, 2023
2 checks passed

near-bulldozer bot deleted the nikurt-order-partial-chunks branch August 17, 2023 11:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(state-sync): Make partial chunks message deterministic #9427

fix(state-sync): Make partial chunks message deterministic #9427

nikurt commented Aug 14, 2023 •

edited

Loading

wacban left a comment

wacban Aug 14, 2023

nikurt Aug 14, 2023

nikurt Aug 14, 2023

wacban Aug 15, 2023

wacban Aug 15, 2023

nikurt Aug 15, 2023 •

edited

Loading

wacban Aug 15, 2023

nikurt Aug 15, 2023

wacban Aug 15, 2023

wacban Aug 15, 2023

wacban Aug 15, 2023

nikurt Aug 15, 2023

wacban Aug 15, 2023

nikurt Aug 15, 2023

wacban left a comment

wacban Aug 15, 2023

		(self.1.from_shard_id, self.1.to_shard_id)
		.cmp(&(other.1.from_shard_id, other.1.to_shard_id))


		state_parts_dir = str(pathlib.Path(tempfile.gettempdir()) / 'state_parts')

		config0 = {

fix(state-sync): Make partial chunks message deterministic #9427

fix(state-sync): Make partial chunks message deterministic #9427

Conversation

nikurt commented Aug 14, 2023 • edited Loading

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikurt Aug 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wacban left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikurt commented Aug 14, 2023 •

edited

Loading

nikurt Aug 15, 2023 •

edited

Loading