-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GuardDogActionsTest flake (ARM optimized) #12638
Comments
@KBaichoo do you mind taking a look? I've seen a few flakes on these tests. You should be able to repro on x86 with |
Running with fastbuild on x86 1000 runs: no issues. Running on x86 with -c opt on 1000 runs, I ran into failure on 2 of them. Both were timeouts that occurred when running under Error log of failure:
|
I'm able to find some flakes in |
I have a draft fix that should be less flakey (i.e. passes 1000x in tsan with -c opt), but because of how Thoughts @mattklein123 ? |
Optimally we should fix the test to be flake free. Can you describe the problem? Can you use simulated time? |
I think the reason we’re having trouble with the We have two time systems within the test, and primarily advance them using
Simulated thus allows us to be in sync from the test thread and the guarddog thread. For real there could be potential paths of execution where the test thread “oversleeps” or gets switched off between instructions and isn’t scheduled, etc. that could introduce races. It seems a lot of this ties into #6465 (where a lot of testcases that are a bit more time sensitive are disabled for So we have a few possible solutions:
|
I will leave it up to you on how to handle. If you want to better synchronize with real time you could potentially use https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread_synchronizer.h. Otherwise using simtime seems like a good solution. If you have a patch that makes it better I would take that now while you work on a better solution since the flake rate is pretty high. |
Added a patch that makes it better. Perhaps we can see how much that decreases flakiness in practice and go from there. |
New flake after your fix (on ARM release):
|
I tried all combinations of @mattklein123 what can I do then to try to reproduce it? / is it still an issue? I'm building from commit 49aedfd |
If you can't repro I would ignore for now. I can try to repro sometime this week and get back to you. |
I'm able to trivially repro lots of failures via |
Running with Guess I didn't cast a wide enough net the first time? All of the flakes were timeouts under Now that I'm able to create the issue I'll be able to debug it. Thanks for pointers for reproducing the issue @mattklein123. I'll give it a look tomorrow. |
After digging around, I think I've hit on a potential reason. The failures I've encountered have all been under This makes things a tad more difficult to poke around since these are death test so when we wrap the I believe the issue relates to some recent changes in simulated time for the following reasons:
Thought @mattklein123 and @antoniovicente ? Thanks |
Hmm. I will defer to @antoniovicente on the event loop changes, but wouldn't surprise me if there have been subtle changes here. Regarding the death test forking, shouldn't writing to std::cerr still work? Doesn't the fork still maintain access to output? |
/assign @antoniovicente Just so I remember to look, feeling like issues like this are piling on. |
Upon further investigation I believe I've also tried doing But this is good news! So we can now have print outs using I annotated Attached below are some runs: Example of non-hanging annotations:
Example of hanging:
So the function seems to be blocked on Is this similar to other bugs you have @antoniovicente ? |
This is probably our most active flake right now as it reproduces regularly on ARM release, probably due to speed. @KBaichoo @antoniovicente any thoughts on interim steps here? Maybe mark this test flaky manual? Disable on ARM, etc.? Thoughts? |
I think #12609 should fix this issue. Without the fix: 20 failures in opt mode out of 1k runs. |
Hopefully fixed by #12609. 🤞 |
The text was updated successfully, but these errors were encountered: