Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(state-sync): Make partial chunks message deterministic #9427

Merged
merged 11 commits into from
Aug 17, 2023

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Aug 14, 2023

The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

  • Shard ids are in a HashSet
  • In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.

@nikurt nikurt requested a review from a team as a code owner August 14, 2023 12:07
@nikurt nikurt requested a review from jakmeier August 14, 2023 12:07
@nikurt nikurt marked this pull request as draft August 14, 2023 12:08
@nikurt nikurt removed the request for review from jakmeier August 14, 2023 12:08
@nikurt nikurt marked this pull request as ready for review August 14, 2023 12:46
@nikurt nikurt requested a review from jakmeier August 14, 2023 12:50
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than sorting here and there can you use a sorted data structure?

Comment on lines +613 to +614
(self.1.from_shard_id, self.1.to_shard_id)
.cmp(&(other.1.from_shard_id, other.1.to_shard_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem sufficiently ordered, there can be multiple receitpts with the same from and to shard ids and thus still be non-deterministic. How about you also compare the hashes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The receipts are grouped by shard ids of outgoing and incoming:

pub struct ReceiptProof(pub Vec<Receipt>, pub ShardProof);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wacban or is your suggestion to order the receipts inside ReceiptProof? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now what's going on. You're claiming that there is at most one ReceiptProof per every pair (from, to) and as such the comparison is actually strict. As long as the assumption is correct I'm fine with this implementation.

I was not suggesting sorting the receipts but that doesn't sound too bad just to be extra cautious. I lack context to say if it is actually needed so I'll leave it up to your best judgment.

@nikurt nikurt requested a review from wacban August 14, 2023 13:12
pytest/tests/sanity/state_sync_then_catchup.py Outdated Show resolved Hide resolved

state_parts_dir = str(pathlib.Path(tempfile.gettempdir()) / 'state_parts')

config0 = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: refactor config generation to a helper function (most of the two configs is identical)

Copy link
Contributor Author

@nikurt nikurt Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More than half of the fields are different or have different values:

  • state_sync.dump vs state_sync.sync
  • store.state_snapshot_enabled
  • tracked_shards
  • traced_shard_schedule
  • state_sync_enabled
  • consensus.state_sync_timeout

@@ -0,0 +1,200 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's good practice to wrap the code in a unittest and a _ _ main _ _ like so:
https://github.com/near/nearcore/tree/master/pytest#creating-new-tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added main(), thanks!
Nayduck doesn't expect these tests to be unittests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as you run unittest.run() from within main it should work fine. Did you try and did it not work? Maybe you also need to put the test logic in a method with name beginning with test_. Not a biggie though, don't worry about it too much.

if block_height == 0:
return 0
if block_height <= EPOCH_LENGTH:
# According to the protocol specifications, there are two epochs with height 1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL & WTF

Comment on lines 139 to 162
if (len(keys) > 100 and random.random() < 0.2) or len(keys) > 1000:
key = keys[random.randint(0, len(keys) - 1)]
call_function('read', key, nonce, boot_node.signer_key,
last_block_hash)
call_function('read', key, nonce, node1.signer_key, last_block_hash)
elif random.random() < 0.5:
if random.random() < 0.3:
key_from, account_to = boot_node.signer_key, node1.signer_key.account_id
elif random.random() < 0.3:
key_from, account_to = boot_node.signer_key, "near"
elif random.random() < 0.5:
key_from, account_to = node1.signer_key, boot_node.signer_key.account_id
else:
key_from, account_to = node1.signer_key, "near"
payment_tx = transaction.sign_payment_tx(key_from, account_to, 1,
nonce, last_block_hash)
boot_node.send_tx(payment_tx).get('result')
else:
key = random_u64()
keys.append(key)
call_function('write', key, nonce, boot_node.signer_key,
last_block_hash)
call_function('write', key, nonce, node1.signer_key,
last_block_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment on what exactly are you doing here?
Also nit: early returns may flatten this code a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured and simplified and commented.

node1_height = node1.get_latest_block().height
logger.info(f'started node1@{node1_height}')

nonce, keys = random_workload_until(int(EPOCH_LENGTH * 3.7), nonce, keys, node1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any checks that you also need to do here or does the workload do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No checks needed. Without the fix the node simply crashes and doesn't reach the target height.

Shreyan Gupta and others added 2 commits August 15, 2023 12:21
@nikurt nikurt requested a review from wacban August 15, 2023 10:23
Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving as looks good but you might have mixed up some extra code from other pr, please review carefully

chain/chain/src/resharding.rs Outdated Show resolved Hide resolved
nearcore/src/runtime/mod.rs Outdated Show resolved Hide resolved
@@ -0,0 +1,200 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as long as you run unittest.run() from within main it should work fine. Did you try and did it not work? Maybe you also need to put the test logic in a method with name beginning with test_. Not a biggie though, don't worry about it too much.

@near-bulldozer near-bulldozer bot merged commit 0759d2f into master Aug 17, 2023
2 checks passed
@near-bulldozer near-bulldozer bot deleted the nikurt-order-partial-chunks branch August 17, 2023 11:20
nikurt added a commit to nikurt/nearcore that referenced this pull request Aug 24, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
nikurt added a commit that referenced this pull request Aug 28, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
nikurt added a commit that referenced this pull request Aug 29, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
nikurt added a commit that referenced this pull request Aug 30, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
nikurt added a commit that referenced this pull request Aug 30, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
nikurt added a commit that referenced this pull request Aug 30, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
nikurt added a commit that referenced this pull request Aug 30, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
nikurt added a commit that referenced this pull request Aug 30, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
nikurt added a commit that referenced this pull request Aug 30, 2023
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because:

* Shard ids are in a HashSet
* In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants