-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(state-sync): Make partial chunks message deterministic #9427
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than sorting here and there can you use a sorted data structure?
(self.1.from_shard_id, self.1.to_shard_id) | ||
.cmp(&(other.1.from_shard_id, other.1.to_shard_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem sufficiently ordered, there can be multiple receitpts with the same from and to shard ids and thus still be non-deterministic. How about you also compare the hashes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The receipts are grouped by shard ids of outgoing and incoming:
pub struct ReceiptProof(pub Vec<Receipt>, pub ShardProof);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wacban or is your suggestion to order the receipts inside ReceiptProof
? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now what's going on. You're claiming that there is at most one ReceiptProof per every pair (from, to) and as such the comparison is actually strict. As long as the assumption is correct I'm fine with this implementation.
I was not suggesting sorting the receipts but that doesn't sound too bad just to be extra cautious. I lack context to say if it is actually needed so I'll leave it up to your best judgment.
|
||
state_parts_dir = str(pathlib.Path(tempfile.gettempdir()) / 'state_parts') | ||
|
||
config0 = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: refactor config generation to a helper function (most of the two configs is identical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than half of the fields are different or have different values:
- state_sync.dump vs state_sync.sync
- store.state_snapshot_enabled
- tracked_shards
- traced_shard_schedule
- state_sync_enabled
- consensus.state_sync_timeout
@@ -0,0 +1,200 @@ | |||
#!/usr/bin/env python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's good practice to wrap the code in a unittest and a _ _ main _ _ like so:
https://github.com/near/nearcore/tree/master/pytest#creating-new-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added main()
, thanks!
Nayduck doesn't expect these tests to be unittests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as you run unittest.run() from within main it should work fine. Did you try and did it not work? Maybe you also need to put the test logic in a method with name beginning with test_. Not a biggie though, don't worry about it too much.
if block_height == 0: | ||
return 0 | ||
if block_height <= EPOCH_LENGTH: | ||
# According to the protocol specifications, there are two epochs with height 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL & WTF
if (len(keys) > 100 and random.random() < 0.2) or len(keys) > 1000: | ||
key = keys[random.randint(0, len(keys) - 1)] | ||
call_function('read', key, nonce, boot_node.signer_key, | ||
last_block_hash) | ||
call_function('read', key, nonce, node1.signer_key, last_block_hash) | ||
elif random.random() < 0.5: | ||
if random.random() < 0.3: | ||
key_from, account_to = boot_node.signer_key, node1.signer_key.account_id | ||
elif random.random() < 0.3: | ||
key_from, account_to = boot_node.signer_key, "near" | ||
elif random.random() < 0.5: | ||
key_from, account_to = node1.signer_key, boot_node.signer_key.account_id | ||
else: | ||
key_from, account_to = node1.signer_key, "near" | ||
payment_tx = transaction.sign_payment_tx(key_from, account_to, 1, | ||
nonce, last_block_hash) | ||
boot_node.send_tx(payment_tx).get('result') | ||
else: | ||
key = random_u64() | ||
keys.append(key) | ||
call_function('write', key, nonce, boot_node.signer_key, | ||
last_block_hash) | ||
call_function('write', key, nonce, node1.signer_key, | ||
last_block_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a small comment on what exactly are you doing here?
Also nit: early returns may flatten this code a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured and simplified and commented.
node1_height = node1.get_latest_block().height | ||
logger.info(f'started node1@{node1_height}') | ||
|
||
nonce, keys = random_workload_until(int(EPOCH_LENGTH * 3.7), nonce, keys, node1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any checks that you also need to do here or does the workload do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No checks needed. Without the fix the node simply crashes and doesn't reach the target height.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving as looks good but you might have mixed up some extra code from other pr, please review carefully
@@ -0,0 +1,200 @@ | |||
#!/usr/bin/env python3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as long as you run unittest.run() from within main it should work fine. Did you try and did it not work? Maybe you also need to put the test logic in a method with name beginning with test_. Not a biggie though, don't worry about it too much.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards. The order of shards is arbitrary because: * Shard ids are in a HashSet * In one case the first the node adds the shards that are cached, and later the shards that are only available on disk.
The issue only happens if a node tracks a subset of shards.
The order of shards is arbitrary because: