GuardDogActionsTest flake (ARM optimized) #12638

mattklein123 · 2020-08-13T23:51:55Z

[ RUN      ] TimeSystemType/GuardDogActionsTest.MissShouldOnlyReportRelevantThreads/0
test/server/guarddog_impl_test.cc:613: Failure
Value of: events_
Expected: has 1 element that is equal to "MISS : 32206"
  Actual: { "MISS : 32206,32206" }, whose element #0 doesn't match
Stack trace:
  0xb4e7e4: testing::internal::HandleExceptionsInMethodIfSupported<>()
  0xb4e6fc: testing::Test::Run()
  0xb4f63c: testing::TestInfo::Run()
... Google Test internal frames ...

[  FAILED  ] TimeSystemType/GuardDogActionsTest.MissShouldOnlyReportRelevantThreads/0, where GetParam() = 4-byte object <00-00 00-00> (123 ms)

The text was updated successfully, but these errors were encountered:

mattklein123 · 2020-08-13T23:52:30Z

@KBaichoo do you mind taking a look? I've seen a few flakes on these tests. You should be able to repro on x86 with -c opt if it doesn't repro on fastbuild. Thank you!

KBaichoo · 2020-08-14T13:11:09Z

Running with fastbuild on x86 1000 runs: no issues.

Running on x86 with -c opt on 1000 runs, I ran into failure on 2 of them.

Both were timeouts that occurred when running under SimulatedTimeSystem, while the arm failure above seems to be on RealTimeSystem.

Error log of failure:

Executing tests from //test/server:guarddog_impl_test
-----------------------------------------------------------------------------
WARNING: Perftools heap leak checker is active -- Performance may suffer
[==========] Running 44 tests from 5 test suites.
[----------] Global test environment set-up.
[----------] 6 tests from TimeSystemType/GuardDogDeathTest
[ RUN      ] TimeSystemType/GuardDogDeathTest.KillDeathTest/0
[       OK ] TimeSystemType/GuardDogDeathTest.KillDeathTest/0 (255 ms)
[ RUN      ] TimeSystemType/GuardDogDeathTest.KillDeathTest/1
TestRandomGenerator running with seed 404803088
[       OK ] TimeSystemType/GuardDogDeathTest.KillDeathTest/1 (158 ms)
[ RUN      ] TimeSystemType/GuardDogDeathTest.MultiKillDeathTest/0
[       OK ] TimeSystemType/GuardDogDeathTest.MultiKillDeathTest/0 (657 ms)
[ RUN      ] TimeSystemType/GuardDogDeathTest.MultiKillDeathTest/1
TestRandomGenerator running with seed 404803088
-- Test timed out at 2020-08-14 12:59:30 UTC --

KBaichoo · 2020-08-14T13:28:37Z

I'm able to find some flakes in RealTimeSystem running under -c opt with tsan. Looking into it.

KBaichoo · 2020-08-14T18:06:42Z

I have a draft fix that should be less flakey (i.e. passes 1000x in tsan with -c opt), but because of how RealTimeSystem works with advanceWaitTimewhich sleeps the thread we aren't 100% guaranteed to have no races / flakes. Would this be sufficiently deflaked?

Thoughts @mattklein123 ?

mattklein123 · 2020-08-14T18:10:59Z

Optimally we should fix the test to be flake free. Can you describe the problem? Can you use simulated time?

KBaichoo · 2020-08-14T19:36:00Z

I think the reason we’re having trouble with the RealTimeSystem is that there’s no good mechanism to synchronize with the guarddog thread and the test thread using real time. The guarddog on construction creates its own thread that runs its own dispatcher that it runs a timer on to call step().

We have two time systems within the test, and primarily advance them using advanceTimeWait, however the different time system have different behaviors:

Real -- sleeps the current thread for at least the given time and then returns
Simulated -- sets the time to the advanced time (i.e. time += duration_to_advance), and waits until all pending timers that should be called during the advancement of time are called.

Simulated thus allows us to be in sync from the test thread and the guarddog thread. For real there could be potential paths of execution where the test thread “oversleeps” or gets switched off between instructions and isn’t scheduled, etc. that could introduce races.

It seems a lot of this ties into #6465 (where a lot of testcases that are a bit more time sensitive are disabled for RealTimeSystem).

So we have a few possible solutions:

Disable real time for these tests (similar to server: de-flake guarddog_impl_test when run in real-time. #6465)
Changes that decrease the likelihood of race conditions
Figure out how to synchronize with RealTimeSystem which would also fix server: de-flake guarddog_impl_test when run in real-time. #6465; though it seems in some parts of the code we’re trying to move towards just using simulated time.

mattklein123 · 2020-08-14T20:02:55Z

I will leave it up to you on how to handle. If you want to better synchronize with real time you could potentially use https://github.com/envoyproxy/envoy/blob/master/source/common/common/thread_synchronizer.h. Otherwise using simtime seems like a good solution.

If you have a patch that makes it better I would take that now while you work on a better solution since the flake rate is pretty high.

KBaichoo · 2020-08-14T20:30:27Z

Added a patch that makes it better. Perhaps we can see how much that decreases flakiness in practice and go from there.

mattklein123 · 2020-08-15T03:32:40Z

New flake after your fix (on ARM release):

[ RUN      ] TimeSystemType/GuardDogDeathTest.MultiKillThresholdDeathTest/1
TestRandomGenerator running with seed -1060420032
-- Test timed out at 2020-08-15 00:44:12 UTC --

KBaichoo · 2020-08-17T15:42:00Z

I tried all combinations of clang-tsan and -c opt running 500x running on x86 to see if I could reproduce this timeout issue, but I haven't been able to.

@mattklein123 what can I do then to try to reproduce it? / is it still an issue? I'm building from commit 49aedfd

mattklein123 · 2020-08-17T16:35:58Z

If you can't repro I would ignore for now. I can try to repro sometime this week and get back to you.

mattklein123 · 2020-08-19T00:13:18Z

I'm able to trivially repro lots of failures via bazel test //test/server:guarddog_impl_test -c opt --runs_per_test=5000 --test_output=errors --jobs=200 --local_test_jobs=200 on my 36 core machine. I will take a look tomorrow.

KBaichoo · 2020-08-19T00:37:41Z

Running with bazel test //test/server:guarddog_impl_test -c opt --runs_per_test=5000 --test_output=errors I was able to get 3 flakes out of 5000.

Guess I didn't cast a wide enough net the first time?

All of the flakes were timeouts under SimulatedTime and they were also all Death Tests.

Now that I'm able to create the issue I'll be able to debug it. Thanks for pointers for reproducing the issue @mattklein123.

I'll give it a look tomorrow.

KBaichoo · 2020-08-19T19:45:02Z

After digging around, I think I've hit on a potential reason.

The failures I've encountered have all been under SimulatedTime when I run bazel test //test/server:guarddog_impl_test -c opt --runs_per_test=5000 --test_output=errors and they're timeouts within tests based off of the test fixture GuardDogDeathTest.

This makes things a tad more difficult to poke around since these are death test so when we wrap the die_function() in an EXPECT_DEATH this causes essentially a fork and exec of the test that will run the die function. This however means std::err~~std::out~~,etc are unavailable to us from the test fixture. If you have thoughts / insights how to dig into that I'd love to hear :).

I believe the issue relates to some recent changes in simulated time for the following reasons:

all failures were encountered under simulated time
there were recent changes to simulated time, and most of these DeathTest have been around longer (years) than the recent change (days). If they were flaky before it likely would have been a persistent issue.
when I disable the new feature in simulated time via the runtime flag: envoy.reloadable_features.activate_timers_next_event_loop I get no flakes running with the command above. It's my understanding that this was introduced in PR [dispatcher] Refactor how callbacks are scheduled in the event loop. #11663 and PR dispatcher: Run zero-delay timeout timers on the next iteration of the event loop #11823 to change the behavior of what timer->enable(0ms) would be which changed it from executing in the current libevent iteration to instead execute in the next iteration.

Thought @mattklein123 and @antoniovicente ?

Thanks

mattklein123 · 2020-08-19T22:28:31Z

Hmm. I will defer to @antoniovicente on the event loop changes, but wouldn't surprise me if there have been subtle changes here. Regarding the death test forking, shouldn't writing to std::cerr still work? Doesn't the fork still maintain access to output?

antoniovicente · 2020-08-19T22:43:42Z

/assign @antoniovicente

Just so I remember to look, feeling like issues like this are piling on.

KBaichoo · 2020-08-20T15:30:30Z

Upon further investigation I believe EXPECT_DEATH() actually hijacks stderr in order to do regex matching on death, but not stdout.

I've also tried doing ENVOY_MISC_LOG() and flushing the misc logger, but that didn't work (I'm not familiar enough with the logger system to know whats going on there).

But this is good news! So we can now have print outs using std::cout.

I annotated advanceTimeWaitImpl in simulated time and notice that the tests that timeout get stuck waiting on pending locks.

Attached below are some runs:

Example of non-hanging annotations:

[ RUN      ] TimeSystemType/GuardDogDeathTest.KillDeathTest/1
TestRandomGenerator running with seed -290074928
Death Test started...
Death Test started...
die_function called...
Starting setting up for Death.
AdvanceTimeWaitImpl enter
AdvanceTimeWaitImpl advancing time
AdvanceTimeWaitImpl advanced time, waiting on pending locks
AdvanceTimeWaitImpl exit
Finished setting up for Death.
AdvanceTimeWaitImpl enter
AdvanceTimeWaitImpl advancing time
AdvanceTimeWaitImpl advanced time, waiting on pending locks
[       OK ] TimeSystemType/GuardDogDeathTest.KillDeathTest/1 (160 ms)

Example of hanging:

[ RUN      ] TimeSystemType/GuardDogDeathTest.MultiKillDeathTest/1
TestRandomGenerator running with seed -290074928
Death Test started...
Death Test started...
die_function called
Starting setting up for MultiDeath.
AdvanceTimeWaitImpl enter
AdvanceTimeWaitImpl advancing time
AdvanceTimeWaitImpl advanced time, waiting on pending locks
-- Test timed out at 2020-08-20 15:22:25 UTC --

So the function seems to be blocked on waitForNoPendingLockHeld() line

Is this similar to other bugs you have @antoniovicente ?

mattklein123 · 2020-08-24T22:45:26Z

This is probably our most active flake right now as it reproduces regularly on ARM release, probably due to speed. @KBaichoo @antoniovicente any thoughts on interim steps here? Maybe mark this test flaky manual? Disable on ARM, etc.? Thoughts?

antoniovicente · 2020-08-25T00:57:54Z

I think #12609 should fix this issue.

Without the fix: 20 failures in opt mode out of 1k runs.
With the fix: 0 failures in opt mode out of 10k runs.

mattklein123 · 2020-08-26T23:17:36Z

Hopefully fixed by #12609. 🤞

mattklein123 added help wanted Needs help! area/test flakes labels Aug 13, 2020

mattklein123 assigned KBaichoo Aug 13, 2020

mattklein123 mentioned this issue Aug 14, 2020

testing: fix multiple race conditions in simulated time tests #12527

Merged

KBaichoo mentioned this issue Aug 14, 2020

Decreased the flakiness of guarddog tests running with real time system. #12659

Merged

repokitteh-read-only bot assigned antoniovicente Aug 19, 2020

antoniovicente mentioned this issue Aug 25, 2020

test: track simulated timers per dispatcher to simplify thread interactions and locking. #12609

Merged

mattklein123 closed this as completed Aug 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GuardDogActionsTest flake (ARM optimized) #12638

GuardDogActionsTest flake (ARM optimized) #12638

mattklein123 commented Aug 13, 2020

mattklein123 commented Aug 13, 2020

KBaichoo commented Aug 14, 2020 •

edited

Loading

KBaichoo commented Aug 14, 2020 •

edited

Loading

KBaichoo commented Aug 14, 2020 •

edited

Loading

mattklein123 commented Aug 14, 2020

KBaichoo commented Aug 14, 2020

mattklein123 commented Aug 14, 2020

KBaichoo commented Aug 14, 2020

mattklein123 commented Aug 15, 2020

KBaichoo commented Aug 17, 2020

mattklein123 commented Aug 17, 2020

mattklein123 commented Aug 19, 2020

KBaichoo commented Aug 19, 2020

KBaichoo commented Aug 19, 2020 •

edited

Loading

mattklein123 commented Aug 19, 2020

antoniovicente commented Aug 19, 2020

KBaichoo commented Aug 20, 2020

mattklein123 commented Aug 24, 2020

antoniovicente commented Aug 25, 2020

mattklein123 commented Aug 26, 2020

GuardDogActionsTest flake (ARM optimized) #12638

GuardDogActionsTest flake (ARM optimized) #12638

Comments

mattklein123 commented Aug 13, 2020

mattklein123 commented Aug 13, 2020

KBaichoo commented Aug 14, 2020 • edited Loading

KBaichoo commented Aug 14, 2020 • edited Loading

KBaichoo commented Aug 14, 2020 • edited Loading

mattklein123 commented Aug 14, 2020

KBaichoo commented Aug 14, 2020

mattklein123 commented Aug 14, 2020

KBaichoo commented Aug 14, 2020

mattklein123 commented Aug 15, 2020

KBaichoo commented Aug 17, 2020

mattklein123 commented Aug 17, 2020

mattklein123 commented Aug 19, 2020

KBaichoo commented Aug 19, 2020

KBaichoo commented Aug 19, 2020 • edited Loading

mattklein123 commented Aug 19, 2020

antoniovicente commented Aug 19, 2020

KBaichoo commented Aug 20, 2020

mattklein123 commented Aug 24, 2020

antoniovicente commented Aug 25, 2020

mattklein123 commented Aug 26, 2020

KBaichoo commented Aug 14, 2020 •

edited

Loading

KBaichoo commented Aug 14, 2020 •

edited

Loading

KBaichoo commented Aug 14, 2020 •

edited

Loading

KBaichoo commented Aug 19, 2020 •

edited

Loading