Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix flaky util-app test in GitHub CI #310

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

felixbr
Copy link
Contributor

@felixbr felixbr commented Aug 2, 2022

As you can see in many PRs (e.g. #308) there is often one of the different build matrix cases red. As far as I can see it's always the same flaky test case in AppTest (in util-app).

I tried many things but couldn't reproduce the timeout locally. As an attempt to fight the failures in CI, I'm increasing the timeout for the grace period and also add retries to the test case.

I understand that this is mostly a band aid and not a great solution, so if you don't like it please let me know how we should tackle this instead.

Cheers
~ Felix

@bryce-anderson
Copy link
Contributor

bryce-anderson commented Aug 2, 2022

Does just increasing the defaultCloseGracePeriod to 30 seconds like you've done do the trick? It seems like a real bug if we cannot ever get the right exit behavior regardless of the grace period.

@felixbr
Copy link
Contributor Author

felixbr commented Aug 5, 2022

Well, locally it always works, even with a very very low grace period. My suspicion is that a different timeout (e.g. from Await.ready) somewhere is firing the TimeoutException and so increasing the grace period is not really helping to "fix" CI.

That is purely speculation, of course, as I couldn't reproduce it even once locally.

I've seen similar problems at work where tests would basically be starved in CI because sbt tries to parallelize heavily but the machine has very little CPU and so futures in tests would be created but then paused for very long periods, leading to timeouts.

If you want me to, I can remove the retry part but eventually will only retry for a while and then fail the test anyway. So if it was really broken it wouldn't change anything for the outcome.

@bryce-anderson
Copy link
Contributor

If you want me to, I can remove the retry part but eventually will only retry for a while and then fail the test anyway. So if it was really broken it wouldn't change anything for the outcome.

I would prefer to try it. If it's not a problem simply due to "not enough CPU time", which I think you're right that it is, and it's actually flaky for other reasons that is broken and should be fixed and retrying will mask the flakiness/brokenness.

@felixbr
Copy link
Contributor Author

felixbr commented May 18, 2023

I've rebased this onto develop and reverted the addition of eventually retries. So it's only an increased timeout and a comment now.

1 similar comment
@felixbr
Copy link
Contributor Author

felixbr commented May 18, 2023

I've rebased this onto develop and reverted the addition of eventually retries. So it's only an increased timeout and a comment now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants