-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core::test_rpc_subscriptions and core::test_rpc_slot_updates are flaky #16970
Comments
Starting to work on it. The debug note doc is https://docs.google.com/document/d/17kJ122_wZbEA8SWqke6tUAyKTksAnYYGSubV-ytn2rw/edit?usp=sharing |
I think I know now the cause, but I am not sure about the fix yet. This test in the main thread sets up a tokio async runtime thread, and runs the tokio tasks in it. The tasks include sending info (signatures, accounts etc) to some crossbeam channels, and the last task is to send a ready message to a ready crossbeam channel. The main thread waits for the ready message and then retrieves the info from the crossbeam channels. The problem is that the tokio async tasks are not run in order, so the ready message is not really sent after all the signatures are sent. So the ready message is not the right way to signal the main thread the signatures are ready to be received. In most cases the signatures are still sent before the main thread retrieves them, but in rare cases there are pauses in the sending timing, and therefore the main thread times out. Hence the flakiness.
Waiting for all previous tokio tasks to be done before sending the ready message should be the correct solution. Tested in an isolated environment and verified it works. But somehow in this test, this “ready_sender.send(())” causes sig_notifications.next() to fail. So the signature tasks fail to send the signatures. I could debug further why it fails. Another way to synchronize is to simply poll status_receiver.len() in the main thread and wait till all signatures are ready to be received. |
Increased the signature waiting timeout with #27008 |
This test failed for me in CI again 😢 |
Problem
The following two commands eventually fail. My estimate is that the tests fail less than 1 in 10 iterations.
It produces the following errors
(also observed with different values than 569)
Proposed Solution
I don’t know enough about what these tests do to propose a solution here.
The text was updated successfully, but these errors were encountered: