Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GuardDogActionsTest flake (ARM optimized) #12638

Closed
mattklein123 opened this issue Aug 13, 2020 · 20 comments
Closed

GuardDogActionsTest flake (ARM optimized) #12638

mattklein123 opened this issue Aug 13, 2020 · 20 comments
Assignees

Comments

@mattklein123
Copy link
Member

[ RUN      ] TimeSystemType/GuardDogActionsTest.MissShouldOnlyReportRelevantThreads/0
test/server/guarddog_impl_test.cc:613: Failure
Value of: events_
Expected: has 1 element that is equal to "MISS : 32206"
  Actual: { "MISS : 32206,32206" }, whose element #0 doesn't match
Stack trace:
  0xb4e7e4: testing::internal::HandleExceptionsInMethodIfSupported<>()
  0xb4e6fc: testing::Test::Run()
  0xb4f63c: testing::TestInfo::Run()
... Google Test internal frames ...

[  FAILED  ] TimeSystemType/GuardDogActionsTest.MissShouldOnlyReportRelevantThreads/0, where GetParam() = 4-byte object <00-00 00-00> (123 ms)
@mattklein123
Copy link
Member Author

@KBaichoo do you mind taking a look? I've seen a few flakes on these tests. You should be able to repro on x86 with -c opt if it doesn't repro on fastbuild. Thank you!

@KBaichoo
Copy link
Contributor

KBaichoo commented Aug 14, 2020

Running with fastbuild on x86 1000 runs: no issues.

Running on x86 with -c opt on 1000 runs, I ran into failure on 2 of them.

Both were timeouts that occurred when running under SimulatedTimeSystem, while the arm failure above seems to be on RealTimeSystem.

Error log of failure:

Executing tests from //test/server:guarddog_impl_test
-----------------------------------------------------------------------------
WARNING: Perftools heap leak checker is active -- Performance may suffer
[==========] Running 44 tests from 5 test suites.
[----------] Global test environment set-up.
[----------] 6 tests from TimeSystemType/GuardDogDeathTest
[ RUN      ] TimeSystemType/GuardDogDeathTest.KillDeathTest/0
[       OK ] TimeSystemType/GuardDogDeathTest.KillDeathTest/0 (255 ms)
[ RUN      ] TimeSystemType/GuardDogDeathTest.KillDeathTest/1
TestRandomGenerator running with seed 404803088
[       OK ] TimeSystemType/GuardDogDeathTest.KillDeathTest/1 (158 ms)
[ RUN      ] TimeSystemType/GuardDogDeathTest.MultiKillDeathTest/0
[       OK ] TimeSystemType/GuardDogDeathTest.MultiKillDeathTest/0 (657 ms)
[ RUN      ] TimeSystemType/GuardDogDeathTest.MultiKillDeathTest/1
TestRandomGenerator running with seed 404803088
-- Test timed out at 2020-08-14 12:59:30 UTC --

@KBaichoo
Copy link
Contributor

KBaichoo commented Aug 14, 2020

I'm able to find some flakes in RealTimeSystem running under -c opt with tsan. Looking into it.

@KBaichoo
Copy link
Contributor

KBaichoo commented Aug 14, 2020

I have a draft fix that should be less flakey (i.e. passes 1000x in tsan with -c opt), but because of how RealTimeSystem works with advanceWaitTimewhich sleeps the thread we aren't 100% guaranteed to have no races / flakes. Would this be sufficiently deflaked?

Thoughts @mattklein123 ?

@mattklein123
Copy link
Member Author

Optimally we should fix the test to be flake free. Can you describe the problem? Can you use simulated time?

@KBaichoo
Copy link
Contributor

I think the reason we’re having trouble with the RealTimeSystem is that there’s no good mechanism to synchronize with the guarddog thread and the test thread using real time. The guarddog on construction creates its own thread that runs its own dispatcher that it runs a timer on to call step().

We have two time systems within the test, and primarily advance them using advanceTimeWait, however the different time system have different behaviors:

  • Real -- sleeps the current thread for at least the given time and then returns

  • Simulated -- sets the time to the advanced time (i.e. time += duration_to_advance), and waits until all pending timers that should be called during the advancement of time are called.

Simulated thus allows us to be in sync from the test thread and the guarddog thread. For real there could be potential paths of execution where the test thread “oversleeps” or gets switched off between instructions and isn’t scheduled, etc. that could introduce races.

It seems a lot of this ties into #6465 (where a lot of testcases that are a bit more time sensitive are disabled for RealTimeSystem).

So we have a few possible solutions:

@mattklein123
Copy link
Member Author

I will leave it up to you on how to handle. If you want to better synchronize with real time you could potentially use https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread_synchronizer.h. Otherwise using simtime seems like a good solution.

If you have a patch that makes it better I would take that now while you work on a better solution since the flake rate is pretty high.

@KBaichoo
Copy link
Contributor

Added a patch that makes it better. Perhaps we can see how much that decreases flakiness in practice and go from there.

@mattklein123
Copy link
Member Author

New flake after your fix (on ARM release):

[ RUN      ] TimeSystemType/GuardDogDeathTest.MultiKillThresholdDeathTest/1
TestRandomGenerator running with seed -1060420032
-- Test timed out at 2020-08-15 00:44:12 UTC --

@KBaichoo
Copy link
Contributor

I tried all combinations of clang-tsan and -c opt running 500x running on x86 to see if I could reproduce this timeout issue, but I haven't been able to.

@mattklein123 what can I do then to try to reproduce it? / is it still an issue? I'm building from commit 49aedfd

@mattklein123
Copy link
Member Author

If you can't repro I would ignore for now. I can try to repro sometime this week and get back to you.

@mattklein123
Copy link
Member Author

I'm able to trivially repro lots of failures via bazel test //test/server:guarddog_impl_test -c opt --runs_per_test=5000 --test_output=errors --jobs=200 --local_test_jobs=200 on my 36 core machine. I will take a look tomorrow.

@KBaichoo
Copy link
Contributor

Running with bazel test //test/server:guarddog_impl_test -c opt --runs_per_test=5000 --test_output=errors I was able to get 3 flakes out of 5000.

Guess I didn't cast a wide enough net the first time?

All of the flakes were timeouts under SimulatedTime and they were also all Death Tests.

Now that I'm able to create the issue I'll be able to debug it. Thanks for pointers for reproducing the issue @mattklein123.

I'll give it a look tomorrow.

@KBaichoo
Copy link
Contributor

KBaichoo commented Aug 19, 2020

After digging around, I think I've hit on a potential reason.

The failures I've encountered have all been under SimulatedTime when I run bazel test //test/server:guarddog_impl_test -c opt --runs_per_test=5000 --test_output=errors and they're timeouts within tests based off of the test fixture GuardDogDeathTest.

This makes things a tad more difficult to poke around since these are death test so when we wrap the die_function() in an EXPECT_DEATH this causes essentially a fork and exec of the test that will run the die function. This however means std::errstd::out,etc are unavailable to us from the test fixture. If you have thoughts / insights how to dig into that I'd love to hear :).

I believe the issue relates to some recent changes in simulated time for the following reasons:

  1. all failures were encountered under simulated time
  2. there were recent changes to simulated time, and most of these DeathTest have been around longer (years) than the recent change (days). If they were flaky before it likely would have been a persistent issue.
  3. when I disable the new feature in simulated time via the runtime flag: envoy.reloadable_features.activate_timers_next_event_loop I get no flakes running with the command above. It's my understanding that this was introduced in PR [dispatcher] Refactor how callbacks are scheduled in the event loop. #11663 and PR dispatcher: Run zero-delay timeout timers on the next iteration of the event loop #11823 to change the behavior of what timer->enable(0ms) would be which changed it from executing in the current libevent iteration to instead execute in the next iteration.

Thought @mattklein123 and @antoniovicente ?

Thanks

@mattklein123
Copy link
Member Author

Hmm. I will defer to @antoniovicente on the event loop changes, but wouldn't surprise me if there have been subtle changes here. Regarding the death test forking, shouldn't writing to std::cerr still work? Doesn't the fork still maintain access to output?

@antoniovicente
Copy link
Contributor

/assign @antoniovicente

Just so I remember to look, feeling like issues like this are piling on.

@KBaichoo
Copy link
Contributor

Upon further investigation I believe EXPECT_DEATH() actually hijacks stderr in order to do regex matching on death, but not stdout.

I've also tried doing ENVOY_MISC_LOG() and flushing the misc logger, but that didn't work (I'm not familiar enough with the logger system to know whats going on there).

But this is good news! So we can now have print outs using std::cout.

I annotated advanceTimeWaitImpl in simulated time and notice that the tests that timeout get stuck waiting on pending locks.

Attached below are some runs:

Example of non-hanging annotations:

[ RUN      ] TimeSystemType/GuardDogDeathTest.KillDeathTest/1
TestRandomGenerator running with seed -290074928
Death Test started...
Death Test started...
die_function called...
Starting setting up for Death.
AdvanceTimeWaitImpl enter
AdvanceTimeWaitImpl advancing time
AdvanceTimeWaitImpl advanced time, waiting on pending locks
AdvanceTimeWaitImpl exit
Finished setting up for Death.
AdvanceTimeWaitImpl enter
AdvanceTimeWaitImpl advancing time
AdvanceTimeWaitImpl advanced time, waiting on pending locks
[       OK ] TimeSystemType/GuardDogDeathTest.KillDeathTest/1 (160 ms)

Example of hanging:

[ RUN      ] TimeSystemType/GuardDogDeathTest.MultiKillDeathTest/1
TestRandomGenerator running with seed -290074928
Death Test started...
Death Test started...
die_function called
Starting setting up for MultiDeath.
AdvanceTimeWaitImpl enter
AdvanceTimeWaitImpl advancing time
AdvanceTimeWaitImpl advanced time, waiting on pending locks
-- Test timed out at 2020-08-20 15:22:25 UTC --

So the function seems to be blocked on waitForNoPendingLockHeld() line

Is this similar to other bugs you have @antoniovicente ?

@mattklein123
Copy link
Member Author

This is probably our most active flake right now as it reproduces regularly on ARM release, probably due to speed. @KBaichoo @antoniovicente any thoughts on interim steps here? Maybe mark this test flaky manual? Disable on ARM, etc.? Thoughts?

@antoniovicente
Copy link
Contributor

I think #12609 should fix this issue.

Without the fix: 20 failures in opt mode out of 1k runs.
With the fix: 0 failures in opt mode out of 10k runs.

@mattklein123
Copy link
Member Author

Hopefully fixed by #12609. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants