Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid_execution_proof_should_not_work fails under high CPU load #1352

Closed
nazar-pc opened this issue Apr 3, 2023 · 4 comments · Fixed by #1357
Closed

invalid_execution_proof_should_not_work fails under high CPU load #1352

nazar-pc opened this issue Apr 3, 2023 · 4 comments · Fixed by #1357
Assignees
Labels
bug Something isn't working execution Subspace execution
Milestone

Comments

@nazar-pc
Copy link
Member

nazar-pc commented Apr 3, 2023

It can be reproduced with busy CPUs. For example, run openssl speed -multi 24 rsa, and then run the test indivisually. You may need to have multiple runs as it sometimes passes even with 100% CPU usage on my machine.

Originally posted by @liuchengxu in #1347 (comment)

@nazar-pc nazar-pc added bug Something isn't working execution Subspace execution labels Apr 3, 2023
@nazar-pc nazar-pc added this to the Gemini 3 milestone Apr 3, 2023
@NingLin-P NingLin-P self-assigned this Apr 3, 2023
@NingLin-P
Copy link
Member

NingLin-P commented Apr 6, 2023

The issue is probably due to the unexpected behavior of async-channel smol-rs/async-channel#23, that even though all the receivers are dropped and the unconsumed sent messages can never be accessed, these messages will not be dropped (perhaps until all the senders is also dropped), the issue has actually described a similar pattern of our slot/block import notification usage.

Root cause of failing tests

In both the test invalid_execution_proof_should_not_work and test_executor_full_node_catching_up, we have set up one system domain authority node alice, and one system domain full node bob.

Because bob will not produce bundle, thus its handle_slot_notifications_fut (which contains the new_slot_notification_stream) is dropped during initialization:
https://github.com/subspace/subspace/blob/67875dd520eb88abcc08a43d5fb05e614b1fde8d/domains/client/domain-executor/src/system_domain_worker.rs#L183

Also because the node is initialized asynchronously if the first slot is produced before bob's handle_slot_notifications_fut is dropped, the slot notification (which contained slot_acknowledgement_sender) can be successfully sent to the new_slot_notification_stream. Even though handle_slot_notifications_fut will be dropped later, the slot_acknowledgement_sender will keep alive in the channel buffer and never drop.

Thus the test is blocked on waiting for the slot acknowledgment due to there is a slot_acknowledgement_sender not closed but also never handled.

Potential damage beyond the test

MockPrimaryNode's slot/block import notification is kind of port from sc-consensus-subspace, thus it may also be affected by this unexpected behavior of async-channel.

In the block import pipeline, it will also wait for all the block importing acknowledgments from the subscriber before finishing, if a subscriber is dropped after the block importing notification is sent to it but before it handles, the block import pipeline will also block forever.

But in our current usage, there seem only a few subscribers of the block import notification stream, the archiver, system domain executor, and core domain executor, both of them are essential and should not be dropped except for fatal errors. So the large damage is that when some fatal errors occur the node may hang unexpectedly, and it may also be a pitfall for the future subscriber of the block import notification stream.

@nazar-pc
Copy link
Member Author

nazar-pc commented Apr 6, 2023

I think this should be reported back to Substrate and we should ping upstream async-channel to fix this. It seems like a legit issue that we need to resolve or revert Substrate to use futures channel.

NingLin-P added a commit that referenced this issue Apr 6, 2023
…nt in MockPrimaryNode

It is possible that the slot/block import subscriber is dropped after notification
then we will block on waiting for acknowledgement forever, see #1352 for more details.
This commit fix this issue by dynamically adjusting the expected acknowledgement during waiting.

Signed-off-by: linning <[email protected]>
@NingLin-P
Copy link
Member

NingLin-P commented Apr 6, 2023

I think this should be reported back to Substrate and we should ping upstream async-channel to fix this

These actions have been taken, I also open #1357 to fix this issue in order to unblock our work, I can also apply it to sc-consensus-subspace if necessary, WDYT @nazar-pc

@nazar-pc
Copy link
Member Author

nazar-pc commented Apr 6, 2023

sc-consensus-subspace is long-running, I don't think we need it there unless we have actual bug reports. Also I'd rather implement fix upstream if we have time for that, so we don't need to write those workarounds in the first place.

NingLin-P added a commit that referenced this issue Apr 6, 2023
…cknowledgements in MockPrimaryNode

It is possible that the slot/block import subscriber is dropped after notification then
we will block on waiting for acknowledgement forever, see #1352 for more details. This
commit fix this issue by proactively drop closed subscribers while waiting acknowledgements.

Signed-off-by: linning <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working execution Subspace execution
Projects
2 participants