Skip to content

Commit

Permalink
Merge pull request #2292 from eisenhauer/DataManReliable
Browse files Browse the repository at this point in the history
Turn off DataMan.Reliable test as unreliable
  • Loading branch information
eisenhauer authored and Chuck Atkins committed May 29, 2020
2 parents 42e44ff + be7e261 commit 248454e
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion testing/adios2/engine/dataman/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,6 @@ if(ADIOS2_HAVE_MPI)
gtest_add_tests_helper(WriterSingleBuffer MPI_NOEXEC DataMan Engine.DataMan. "")
gtest_add_tests_helper(ReaderDoubleBuffer MPI_NOEXEC DataMan Engine.DataMan. "")
gtest_add_tests_helper(ReaderSingleBuffer MPI_NOEXEC DataMan Engine.DataMan. "")
gtest_add_tests_helper(Reliable MPI_NOEXEC DataMan Engine.DataMan. "")
# Turning off DataMan.Reliable test as unstable - May 27, 2020
# gtest_add_tests_helper(Reliable MPI_NOEXEC DataMan Engine.DataMan. "")
endif()

2 comments on commit 248454e

@JasonRuonanWang
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eisenhauer @chuckatkins
Can we bring back the serial test? I guess it's the MPI stuff that is not stable. The serial test should work, because we have been using the serial reliable mode in our KSTAR workflow intensively and we haven't had this issue.

@eisenhauer
Copy link
Member Author

@eisenhauer eisenhauer commented on 248454e Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JasonRuonanWang, I'm not overly familiar with the test setup here, but from the comments in gtest_add_tests_helper, the MPI_NOEXEC flag here says: "MPI_NOEXEC - build MPI test, execute without MPI; also build Serial test"
I.E. it wasn't being run with MPI when we were seeing failures, but only being built with MPI linked in.

Speaking from personal experience with SST, I've see a lot of timing failures and race conditions in CI that I've never seen in "real life". CI is good and bad that way. It's incredibly useful because the variably-loaded machines expose your code to hundreds of runs in different timing situations, which exposes bugs. But it's also annoying because those are really hard bugs to find and fix.

I think the impetus to turn off this test were failures that showed up when we turn off ADIOS_TEST_REPEAT. There's some wonkiness in CI: In the win2016-vs2017-msmpi-ninja build, MPI runs randomly fail MPI_Init() some small percentage of the time, and in el7-gnu8-openmpi-ohpc some small percentages of MPI jobs simply don't start. Both result in test failures that are due to problems in the platform rather than the code. ADIOS_TEST_REPEAT tells ctest to rerun failed tests up to N times (usually N=5) to see if the failure is reliable. That's great for hiding these random platform failures and lets us include them in CI despite their issues. But it's also very effective at hiding timing bugs and race conditions in our code. If an engine has a bug that made it deadlock 10% of the time in CI, and you only report it if it fails 5 times in a row, it's going to look like you're 99.999% reliable, instead of 90%.

My take: the test failures we saw probably mean that there's a timing bug somewhere in DataManReliable. If you want to find it, you can do like I'm doing with PR #2299 (and before that #2252, etc.) and do CI runs with ADIOS_TEST_REPEAT off to try to diagnose it. But, we can also just add the test back to CI and the ADIOS_TEST_REPEAT will effectively hide the random failures...

Please sign in to comment.