Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to dodge flakiness in heavy_tasks_doesnt_block_graphql test #2437

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rafal-ch
Copy link
Contributor

@rafal-ch rafal-ch commented Nov 14, 2024

Closes #2435

Description

This is an attempt to resolve observed flakiness in the heavy_tasks_doesnt_block_graphql test.

When I tested the fix locally, with much smaller (250ms) timeouts and debug prints, I could observe the following outcome:

running 1 test
ERR
ERR
OK
test dos::heavy_tasks_doesnt_block_graphql ... ok

Changes:

  1. Allow retrying the "health" check request 3 times
  2. Because we now allow 3 tries instead of just 1
    3. Reduce timeout from 5 to 4 seconds
    4. Spam the node with 3 times as much requests in the background (50 -> 150).

This is an effort to make the test more resilient to the performance of the machine it is executed on.

Before requesting review

  • I have reviewed the code myself

@rafal-ch rafal-ch added the no changelog Skip the CI check of the changelog modification label Nov 14, 2024
@rafal-ch rafal-ch requested a review from a team November 14, 2024 16:38
Copy link
Contributor

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit hesitant about having retry logic here. Isn't the point of the test to ensure that the service is available despite the load? Adding retry logic seems like it defeats the purpose of the test.

What is the reason the health query can time out other than the heavy tasks actually blocking the queries? I'd take it that if this test still fails occasionally in CI, that we'd need to look into if we can do further improvements in reducing the risk of the heavy tasks blocking the other requests.

@rafal-ch
Copy link
Contributor Author

rafal-ch commented Nov 15, 2024

I'm a bit hesitant about having retry logic here. Isn't the point of the test to ensure that the service is available despite the load? Adding retry logic seems like it defeats the purpose of the test.

The reason why I did it like that is that we do not give strict guarantees about the service responsiveness under the load. One could ask why we allowed exactly 5 seconds in the original test implementation? Does it mean that we should always respond within 5 sec., regardless the load and the machine we're running at? Probably not. I figured that having a couple of short retries would be better than just increasing the timeout arbitrarily.

we can do further improvements in reducing the risk of the heavy tasks blocking the other requests

The question here is: reduce to what level? In fact, no matter what improvements you do, you'll always observe some randomness in response time based on the load of the machine, etc.

Anyway, I'm happy to bench and work on potential improvements in this regard if we see the need. Especially if we observe this test is still failing with the retries, which means we have a deeper issue. For this PR though, the goal is to make CI runs less flaky.

@netrome
Copy link
Contributor

netrome commented Nov 15, 2024

I'm a bit hesitant about having retry logic here. Isn't the point of the test to ensure that the service is available despite the load? Adding retry logic seems like it defeats the purpose of the test.

The reason why I did it like that is that we do not give strict guarantees about the service responsiveness under the load. One could ask why we allowed exactly 5 seconds in the original test implementation? Does it mean that we should always respond within 5 sec., regardless the load and the machine we're running at? Probably not. I figured that having a couple of short retries would be better than just increasing the timeout arbitrarily.

we can do further improvements in reducing the risk of the heavy tasks blocking the other requests

The question here is: reduce to what level? In fact, no matter what improvements you do, you'll always observe some randomness in response time based on the load of the machine, etc.

Anyway, I'm happy to bench and work on potential improvements in this regard if we see the need. Especially if we observe this test is still failing with the retries, which means we have a deeper issue. For this PR though, the goal is to make CI runs less flaky.

Fair enough. I guess part of the problem here is that the original test is inherently flaky, and that the responsiveness behavior we're testing isn't well defined. I'd question if we should even keep the test or just disable it/remove it. But since this actually makes our CI behave better I'll approve.

Copy link
Collaborator

@xgreenx xgreenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One could ask why we allowed exactly 5 seconds in the original test implementation? Does it mean that we should always respond within 5 sec., regardless the load and the machine we're running at? Probably not.

The machine should answer in 5 seconds regardless of its load. Otherwise, the liveness check will fail.

Maybe we should consider running this test without other tests in parallel to avoid thread-blocking tasks or having too many threads.

Also, maybe we need just to increase number_of_threads in the config

image

Copy link
Collaborator

@xgreenx xgreenx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafal-ch Could you check, do you still have a problem after #2401 ?

@rafal-ch
Copy link
Contributor Author

@rafal-ch Could you check, do you still have a problem after #2401 ?

With the original 5 sec timeout I was never able to reproduce the problem locally (neither with #2401 nor without it). It was happening on CI only.

When I reduce the timeout the test is still flaky on my local machine.

In short: no change in the behavior observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
no changelog Skip the CI check of the changelog modification
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaxy test: heavy_tasks_doesnt_block_graphql
3 participants