-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to dodge flakiness in heavy_tasks_doesnt_block_graphql
test
#2437
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit hesitant about having retry logic here. Isn't the point of the test to ensure that the service is available despite the load? Adding retry logic seems like it defeats the purpose of the test.
What is the reason the health query can time out other than the heavy tasks actually blocking the queries? I'd take it that if this test still fails occasionally in CI, that we'd need to look into if we can do further improvements in reducing the risk of the heavy tasks blocking the other requests.
The reason why I did it like that is that we do not give strict guarantees about the service responsiveness under the load. One could ask why we allowed exactly 5 seconds in the original test implementation? Does it mean that we should always respond within 5 sec., regardless the load and the machine we're running at? Probably not. I figured that having a couple of short retries would be better than just increasing the timeout arbitrarily.
The question here is: reduce to what level? In fact, no matter what improvements you do, you'll always observe some randomness in response time based on the load of the machine, etc. Anyway, I'm happy to bench and work on potential improvements in this regard if we see the need. Especially if we observe this test is still failing with the retries, which means we have a deeper issue. For this PR though, the goal is to make CI runs less flaky. |
Fair enough. I guess part of the problem here is that the original test is inherently flaky, and that the responsiveness behavior we're testing isn't well defined. I'd question if we should even keep the test or just disable it/remove it. But since this actually makes our CI behave better I'll approve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One could ask why we allowed exactly 5 seconds in the original test implementation? Does it mean that we should always respond within 5 sec., regardless the load and the machine we're running at? Probably not.
The machine should answer in 5 seconds regardless of its load. Otherwise, the liveness check will fail.
Maybe we should consider running this test without other tests in parallel to avoid thread-blocking tasks or having too many threads.
Also, maybe we need just to increase number_of_threads
in the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the original 5 sec timeout I was never able to reproduce the problem locally (neither with #2401 nor without it). It was happening on CI only. When I reduce the timeout the test is still flaky on my local machine. In short: no change in the behavior observed. |
Closes #2435
Description
This is an attempt to resolve observed flakiness in the
heavy_tasks_doesnt_block_graphql
test.When I tested the fix locally, with much smaller (250ms) timeouts and debug prints, I could observe the following outcome:
Changes:
3. Reduce timeout from 5 to 4 seconds
4. Spam the node with 3 times as much requests in the background (50 -> 150).
This is an effort to make the test more resilient to the performance of the machine it is executed on.
Before requesting review