Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase timeout to reduce the flakyness of rpc signature receving test #27008

Merged
merged 2 commits into from
Aug 9, 2022

Conversation

xiangzhu70
Copy link
Contributor

@xiangzhu70 xiangzhu70 commented Aug 8, 2022

Problem

The test sends the signatures and then a ready message all in tokio async tasks. The problem is that tokio async tasks are not run in order, so receiving the ready message does not guarantee that the signature sending is done.
This test is too flaky at 15-second timeout. Debugging show occasional multi-second delay which could come from multiple sources -- other tokio tasks, tokio scheduler, OS scheduler. The async nature makes it hard to track down the origin of the delay.

Summary of Changes

Without fundamentally changing the test design, this change only increases the timeout to reduce the flakiness of the test.

Fixes # 16970

thread 'test_rpc_subscriptions' panicked at 'recv_timeout, 569/1000 signatures remaining', core/tests/rpc.rs:354:17

@xiangzhu70 xiangzhu70 marked this pull request as ready for review August 9, 2022 05:24
@xiangzhu70 xiangzhu70 requested a review from brooksprumo August 9, 2022 05:25
@xiangzhu70 xiangzhu70 self-assigned this Aug 9, 2022
@brooksprumo
Copy link
Contributor

I understand this PR does not fix the underlying problem with the test, but does this PR meaningfully reduce the frequency this test fails? I'd guess my CI probably fails due to this test 1 in 10 times. If this PR improves that ratio, then I'm onboard for approving this change.

@brooksprumo
Copy link
Contributor

Optionally, could the test be fixed by adding in an "ACK"? After Thread 1 says "READY", can it wait until Thread 2 responds with an "ACK" before starting the timeout countdown?

@xiangzhu70
Copy link
Contributor Author

I understand this PR does not fix the underlying problem with the test, but does this PR meaningfully reduce the frequency this test fails? I'd guess my CI probably fails due to this test 1 in 10 times. If this PR improves that ratio, then I'm onboard for approving this change.

Yes, this definitely reduces the flakiness down to almost 0 error rate. I cannot reproduce this failure any more.

@xiangzhu70
Copy link
Contributor Author

Optionally, could the test be fixed by adding in an "ACK"? After Thread 1 says "READY", can it wait until Thread 2 responds with an "ACK" before starting the timeout countdown?

I thought about adding an intermediate sync step like you suggested. But then that "wait until Thread 2 responds with an ACK" is another wait for which a timeout need to be set up and handled. It is pretty much the same as waiting in the original wait with an increased timeout.

Copy link
Contributor

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for the sake of reducing CI failures. If this test starts failing frequently again, I do not think we should indefinitely increase the timeout, but instead fix outright.

@brooksprumo
Copy link
Contributor

Heh, just hit this again... A proper fix is likely required.

https://buildkite.com/solana-labs/solana/builds/79632#01829260-52a4-4e55-bb0f-cd67d72b439b

@steviez
Copy link
Contributor

steviez commented Aug 12, 2022

I've been running into this as well on v1.11 (0/4 at the moment 😢 ); if this test was found to reduce incidence rate, maybe we should backport ?
https://buildkite.com/solana-labs/solana/builds/79663#_

image

mergify bot pushed a commit that referenced this pull request Aug 15, 2022
…st (#27008)

* Increase timeout to reduce the flakyness of rpc signature receving test

* Minor fmt fix

(cherry picked from commit 52d8a20)
CriesofCarrots pushed a commit that referenced this pull request Aug 15, 2022
…st (backport #27008) (#27156)

Increase timeout to reduce the flakyness of rpc signature receving test (#27008)

* Increase timeout to reduce the flakyness of rpc signature receving test

* Minor fmt fix

(cherry picked from commit 52d8a20)

Co-authored-by: Xiang Zhu <[email protected]>
xiangzhu70 added a commit to xiangzhu70/solana that referenced this pull request Aug 17, 2022
…st (solana-labs#27008)

* Increase timeout to reduce the flakyness of rpc signature receving test

* Minor fmt fix
@bw-solana
Copy link
Contributor

Heh, just hit this again... A proper fix is likely required.

https://buildkite.com/solana-labs/solana/builds/79632#01829260-52a4-4e55-bb0f-cd67d72b439b

I believe this is the next wait in the same test (not the initial issue mitigated by Xiang). Need to increase that timeout as well (CC #27214)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants