Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2292 from eisenhauer/DataManReliable
Turn off DataMan.Reliable test as unreliable
- Loading branch information
248454e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eisenhauer @chuckatkins
Can we bring back the serial test? I guess it's the MPI stuff that is not stable. The serial test should work, because we have been using the serial reliable mode in our KSTAR workflow intensively and we haven't had this issue.
248454e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JasonRuonanWang, I'm not overly familiar with the test setup here, but from the comments in gtest_add_tests_helper, the MPI_NOEXEC flag here says: "MPI_NOEXEC - build MPI test, execute without MPI; also build Serial test"
I.E. it wasn't being run with MPI when we were seeing failures, but only being built with MPI linked in.
Speaking from personal experience with SST, I've see a lot of timing failures and race conditions in CI that I've never seen in "real life". CI is good and bad that way. It's incredibly useful because the variably-loaded machines expose your code to hundreds of runs in different timing situations, which exposes bugs. But it's also annoying because those are really hard bugs to find and fix.
I think the impetus to turn off this test were failures that showed up when we turn off ADIOS_TEST_REPEAT. There's some wonkiness in CI: In the win2016-vs2017-msmpi-ninja build, MPI runs randomly fail MPI_Init() some small percentage of the time, and in el7-gnu8-openmpi-ohpc some small percentages of MPI jobs simply don't start. Both result in test failures that are due to problems in the platform rather than the code. ADIOS_TEST_REPEAT tells ctest to rerun failed tests up to N times (usually N=5) to see if the failure is reliable. That's great for hiding these random platform failures and lets us include them in CI despite their issues. But it's also very effective at hiding timing bugs and race conditions in our code. If an engine has a bug that made it deadlock 10% of the time in CI, and you only report it if it fails 5 times in a row, it's going to look like you're 99.999% reliable, instead of 90%.
My take: the test failures we saw probably mean that there's a timing bug somewhere in DataManReliable. If you want to find it, you can do like I'm doing with PR #2299 (and before that #2252, etc.) and do CI runs with ADIOS_TEST_REPEAT off to try to diagnose it. But, we can also just add the test back to CI and the ADIOS_TEST_REPEAT will effectively hide the random failures...