Skip to content
This repository has been archived by the owner on Jan 13, 2025. It is now read-only.

RPC unit test timeout fix #28998

Merged
merged 1 commit into from
Dec 2, 2022

Conversation

bw-solana
Copy link
Contributor

Problem

rpc_subscriptions::tests::test_check_account_subscribe fails sporadically in CI with a receiver timeout.

This is caused by a race condition between completing processing of the subscribed notification and processing bank transaction. We need the subscription to fully process so they will be notified of the bank transaction, but occasionally (~10% of the time) the bank transaction completes first because these activities are handled in separate threads.

Note this is essentially the same thing observed in #27480 and partially mitigated by #27481

Proposed Solution

Add a short delay in the test_check_account_subscribe test to allow for the subscribed notification to be handled before proceeding with processing the bank transaction.

@bw-solana bw-solana marked this pull request as ready for review November 30, 2022 19:38
// subscribed notification before the bank transaction is processed. Without this
// sleep, the bank transaction ocassionally completes first and we hang forever
// waiting to receive a bank notification.
std::thread::sleep(Duration::from_millis(100));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible to poll here on the subscribe notification getting processed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better than naked sleep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there's existing instrumentation we can poll to check for completion (but I would love to be wrong!).

We can certainly add the instrumentation, but I'm not crazy about it. It adds a fair amount of complexity for something we would only use for unit tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the assert_subscribed method is about the best we have. I wonder if there's some ordering in subscription creation that should be tweaked so that we only pass that assert when everything is completed.
Brennan, could you tell where in the subscription we get to when the tx is processed, in the failure cases?

Copy link
Contributor Author

@bw-solana bw-solana Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure exactly how far we get, but I know we didn't make it to the NotificationEntry::Subscribed type match in fn process_notifications because we didn't hit the print I placed there. I'm sure it varies run-to-run due to the nature of being async.

The assert only ensures completion of the part that is serialized with respect to bank transaction, so it doesn't provide much of anything

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. For sure it will vary, I was just curious what you saw.
Yeah, I am thinking if we can rework the subscription registration so that the map entry checked by the assert is the last thing that happens, we could make the assert actually useful. I'll take a look at some point, and we can make this test more robust, if that approach works.
Appreciate you following up on this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. For sure it will vary, I was just curious what you saw.

Yeah, I am thinking if we can rework the subscription registration so that the map entry checked by the assert is the last thing that happens, we could make the assert actually useful. I'll take a look at some point, and we can make this test more robust, if that approach works.

Appreciate you following up on this!

Sounds good! Thank you!

@bw-solana bw-solana merged commit 9cb0151 into solana-labs:master Dec 2, 2022
@bw-solana bw-solana deleted the rpc_test_timeout_fix branch December 2, 2022 16:47
gnapoli23 pushed a commit to gnapoli23/solana that referenced this pull request Dec 16, 2022
nickfrosty pushed a commit to nickfrosty/solana that referenced this pull request Jan 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants