Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: timeout in new kube tests #17076

Closed
edsantiago opened this issue Jan 11, 2023 · 8 comments
Closed

CI: timeout in new kube tests #17076

edsantiago opened this issue Jan 11, 2023 · 8 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

PR #16880 broke CI. There's a new timeout flake, cause unknown, but it is pretty likely to be caused by 16880. @ygalblum please fix or revert ASAP. I looked more closely at the logs this morning, and it's not the known checkpoint bug, this is a new one.

As a side note, this was a bad merge. We should not merge PRs in which there are six retries on a new unexplained flake.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Jan 11, 2023
@ygalblum
Copy link
Contributor

@edsantiago The retries were added following this comment #16880 (comment) from @Luap99. The reason for the retry is that podman kube play may return before the container is ready to get connections. As a result, in the CI, the test fails to connect to the port (while locally it passes all the time).

Furthermore, I don't understand why you think this change causes the test to hang. Looking at the test logs you've provided, these tests pass and most of them are not even marked as "SLOW TEST". Moreover, this PR was merged only yesterday and I've been seeing such issues with the CI before that.

I don't mind reverting the test code, but I don't understand why you think this code is the source of the issue

@edsantiago
Copy link
Member Author

Post hoc, ergo propter hoc. Imperfect (to say the least!) reasoning, but I have a pretty good sense for CI flakes and these are new ones. They started with your PR. We are now seeing them in other PRs. There could be a common factor just slightly before your PR, in which the timeout didn't trigger, but the smart money is on it being something in your PR.

@ygalblum
Copy link
Contributor

But, look at the commits following the merge of my PR. While they are red, they fail on other tests and the test that timeouts for me passes

@edsantiago
Copy link
Member Author

Ah, sorry, this may be a terminology issue. A "test flake" is a condition in which a test is unreliable: it passes sometimes, fails other times, for reasons yet unknown. Test flakes are very hard to reproduce, hence hard to understand or fix.

@Luap99
Copy link
Member

Luap99 commented Jan 11, 2023

All I said is that the single test should have a retry for tcp/udp connection, the full suite should never hang like that.

@Luap99
Copy link
Member

Luap99 commented Jan 11, 2023

Looking at the code we most likely need to add conn.SetDeadline() otherwise a read can block forever.

@ygalblum
Copy link
Contributor

Please see #17083

@edsantiago
Copy link
Member Author

#17083 is merged, closing this in hopes that the problem is fixed. Thanks @ygalblum.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 4, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 4, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

3 participants