-
Notifications
You must be signed in to change notification settings - Fork 664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(fix): Making stress.py with node_restart mode pass, and fixing #2916 #3036
Conversation
Inviting @mfornet to review the changes to the network tests infra. |
pytest/tests/sanity/repro_2916.py
Outdated
await conn0.send(peer_message) | ||
|
||
received_response = False | ||
# make several attempts, in case the node sends us a block or some other message before sending the response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this robust? should we wait until a PartialEncodedChunkResponse is sent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this loop is to see whether it is ever sent. I can have a wall clock timeout instead of the number of messages, but generally with the current nearcore it is very robust (the PECR message is either first or second, if a block was produced in the meantime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general changes to python infrastructure looks good, however wrote some suggestions about how to improve robustness for test repro_2916.py.
pytest/lib/peer.py
Outdated
async def recv(self, expected=None): | ||
while True: | ||
response_raw = await self.recv_raw() | ||
if response_raw is None: | ||
return None | ||
|
||
response = BinarySerializer(schema).deserialize( | ||
response_raw, PeerMessage) | ||
|
||
if expected is None or response.enum == expected: | ||
return response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change expected
(enum name) to expected_fn
a function that determines if this is the message we are waiting for.
async def recv(self, expected=None): | |
while True: | |
response_raw = await self.recv_raw() | |
if response_raw is None: | |
return None | |
response = BinarySerializer(schema).deserialize( | |
response_raw, PeerMessage) | |
if expected is None or response.enum == expected: | |
return response | |
async def recv(self, expected_fn=None): | |
while True: | |
response_raw = await self.recv_raw() | |
if response_raw is None: | |
return None | |
response = BinarySerializer(schema).deserialize( | |
response_raw, PeerMessage) | |
if expected_fn is None or expected_fn(response): | |
return response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to break existing users, so made it both compare and try to call.
pytest/tests/sanity/repro_2916.py
Outdated
for i in range(5): | ||
response = await conn0.recv() | ||
if response is None: # None means timeout | ||
break | ||
elif response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse': | ||
print("Received response for shard %s" % shard_ord) | ||
received_response = True | ||
break | ||
else: | ||
print(response.enum) | ||
else: | ||
print("Didn't receive response for shard %s" % shard_ord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned by @bowenwang1996 this might not be very reliable. Instead we can pass a filter to recv
such that it waits until the expected message arrives. As per suggestion above, we can use expect_fn
like the following snippet:
for i in range(5): | |
response = await conn0.recv() | |
if response is None: # None means timeout | |
break | |
elif response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse': | |
print("Received response for shard %s" % shard_ord) | |
received_response = True | |
break | |
else: | |
print(response.enum) | |
else: | |
print("Didn't receive response for shard %s" % shard_ord) | |
try: | |
expected_fn = lambda response : response.enum == 'Routed' and response.Routed.body.enum == 'PartialEncodedChunkResponse' | |
_response = await asyncio.wait_for(conn0.recv(expected_fn), 5) | |
recieved_response = True | |
break | |
except concurrent.futures.TimeoutError: | |
print("Didn't receive response for shard %s" % shard_ord) |
This way we only have a timeout in one place, and to wait exactly for the message we are expecting, this will work even if we receive in between several not relevant messages, as long as the expected message arrives before the timeout.
After this change stress.py passes consistently, and is reintroduced to nightly. Nearcore fixes: [v] We had a bug in the syncing logic (with a low chance of being triggered in the wild): if a block is produced, and between 1/3 and 2/3 of block producers received it, and the rest have not, the system stalls, because no 2/3 of block producers have the same head, but also nobody is two blocks behind the highest peer to start syncing. Fixing it by forcing sync if we've been 1 block behind for too long. stress.py was reproducing this issue in every run Test fixes [v] Fixing a scenario in which a failure to send a transaction to all validators resulted in recording an incorrect tx hash alongside the tx. Later when checking balances using the incorrect hash resulted in getting incorrect success value, and thus applying incorrect corrections to the expected balances; [ ] Removing the old infrastructure for network interference, which relied on certain node setup, and instead using the new network proxy infrastructure. [ ] Adding a new argument that controls the percentage of dropped messages between nodes. [v] Changing the order of magnitude of staking transactions, so that the validator set actually changes. [ ] Altering `node_restart` process to ocassionally wipe out the data folder of the node, so that we stress state sync (and syncing in general) more Other issues discovered while fixing stress.py: - #2906
Co-authored-by: Bowen Wang <[email protected]>
3b63b2e
to
b17ccc3
Compare
Codecov Report
@@ Coverage Diff @@
## master #3036 +/- ##
=======================================
Coverage 87.64% 87.64%
=======================================
Files 212 212
Lines 41861 41861
=======================================
Hits 36691 36691
Misses 5170 5170 Continue to review full report at Codecov.
|
…#3036) After this change stress.py node_restart passes relatively consistently, and is reintroduced to nightly. Nearcore fixes: - We had a bug in the syncing logic (with a low chance of being triggered in the wild): if a block is produced, and between 1/3 and 2/3 of block producers received it, and the rest have not, the system stalls, because no 2/3 of block producers have the same head, but also nobody is two blocks behind the highest peer to start syncing. Fixing it by forcing sync if we've been 1 block behind for too long. stress.py was reproducing this issue in every run - (#2916) we had an issue that if a node produced a chunk, and then crashed, on recovery it was not able to serve it because it didn't have all the parts and receipts stored in the storage from which we recover cache entries in the shards manager. Fixing it by always storing all the parts and receipts (redundantly) for chunks in the shards we care about. Test fixes [v] Fixing a scenario in which a failure to send a transaction to all validators resulted in recording an incorrect tx hash alongside the tx. Later when checking balances using the incorrect hash resulted in getting incorrect success value, and thus applying incorrect corrections to the expected balances; [v] Changing the order of magnitude of staking transactions, so that the validator set actually changes. Other issues discovered while fixing stress.py: - #2906
Since #3036, nodes will persist all the partial encoded chunk parts for the shard they track. However, on mainnet there are quite a few blocks produced before the PR landed and therefore we need to do a migration for archival nodes that contain all the historical data. Test plan ---------- Tested the migration on a mainnet archival node.
Since #3036, nodes will persist all the partial encoded chunk parts for the shard they track. However, on mainnet there are quite a few blocks produced before the PR landed and therefore we need to do a migration for archival nodes that contain all the historical data. Test plan ---------- Tested the migration on a mainnet archival node.
After this change stress.py node_restart passes relatively consistently, and is reintroduced to nightly.
Nearcore fixes:
We had a bug in the syncing logic (with a low chance of being
triggered in the wild): if a block is produced, and between 1/3 and 2/3
of block producers received it, and the rest have not, the system
stalls, because no 2/3 of block producers have the same head, but also
nobody is two blocks behind the highest peer to start syncing. Fixing it
by forcing sync if we've been 1 block behind for too long.
stress.py was reproducing this issue in every run
(After a restart, a node doesn't respond to chunk requests that have non-empty list of requested receipts #2916) we had an issue that if a node produced a chunk, and then
crashed, on recovery it was not able to serve it because it didn't have
all the parts and receipts stored in the storage from which we recover
cache entries in the shards manager. Fixing it by always storing all the
parts and receipts (redundantly) for chunks in the shards we care about.
Test fixes
[v] Fixing a scenario in which a failure to send a transaction to all
validators resulted in recording an incorrect tx hash alongside the tx.
Later when checking balances using the incorrect hash resulted in
getting incorrect success value, and thus applying incorrect corrections
to the expected balances;
[v] Changing the order of magnitude of staking transactions, so that the
validator set actually changes.
Other issues discovered while fixing stress.py: