-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Kokkos, KokkosKernels, and Panzer test failures on CUDA 8.0 and CUDA 9.0 builds after Kokkos and KokkosKernels update #2827
Comments
@trilinos/kokkos The failing test
for example, as shown at: This test failures this way for the builds:
on 'hansen', 'white', and 'ride'. This sounds a lot like the prior failure reported and addressed in #2471. Does this one unit test just need to be disabled on these machines as well? |
@trilinos/panzer The which shows the output:
The same failure is shown for the following Panzer tests in the 'opt' builds:
Is this an error in Kokkos or an error in the way that Panzer is using Kokkos? |
@trilinos/kokkos, @trilinos/kokkos, @trilinos/panzer, Should this Kokkos and KokkosKernels update be backed out or can it be fixed pretty quickly? |
@bartlettroscoe I'm rebuilding a cuda8 with debug build with your config instructions. Cuda debug builds are especially slow, so I may not be able to dig too far into this until tomorrow, not sure from the cdash output exactly what is causing the error in the Panzer examples. As far as the Kokkos and KokkosKernels tests, they should be disabled for the debug builds. The nested parallelism test runs into problems with GPU resources (I don't recall the specifics, I'll have to dig back through the issues where this was discussed for reminder and reference), the serial spgemm test is just really slow and even worse in debug mode. I'm not sure how running of the tests is wired into the testing harness or scripts, but for Kokkos adding |
@ndellingwood, I will see about adding those disables to the KokkosCore and KokkosContainers tests. The other issue are the failing Panzer tests showing:
Who should debug those? |
@bartlettroscoe I've been able to gather some info on one of the Panzer failures by enabling Panzer's examples in a cuda build I had on White (release build). I have a separate debug build going based on the config info you provided. I'm not able to use xterm on White to run cuda-gdb for the 4 mpi proc tests which is making pinning down the issue a bit of work, which will determine who gets to help fix it. Running this failing test
Looking at the debug stack trace and adding a few print statements in the Intrepid2 file Here is the potential @crtrott can the recent eti changes have resulted in different behavior than before? I started going through the changes but haven't made it far enough through to reconcile with earlier code. In any case, it's not clear to me why this would be choking with the I probably won't be able to do much more tonight, can pick up tomorrow. |
@ndellingwood thanks for looking into this. Let me and Kyungjoo know if you need help with that. |
Thanks @mperego ! The issue is manifesting in Intrepid2 through a I did not have much time to look farther in this, but a couple additional pieces of info I've gathered for reference: I add some print statements to the |
Why did this stuff not trigger in our integration builds? Do we have any idea? |
Panzer examples are not enabled in the integration builds. |
We can add this atdm cuda-dbg build as part of integration testing from here on. |
Actually, I recommend we replace our current CUDA integration configuration with one or more of these ATDM configurations. |
Since shepard will be taken out of service soon we should also use the ATDM configurations for those builds as well. |
What is interesting is there the CUDA 9.0 build on 'hansen' does not show any failing tests for Kokkos or KokkosKernels as shown at: However, we don't have results for Panzer tests because the Jenkins builds timed out before they could run. (I am addressing that in #2832.) Is Kokkos and KokkosKernels tested with CUDA 8.0? |
Kokkos and KokkosKernels are both tested with Cuda 8.0 and 9.0 as individual packages, but only 8.0 was used during the Trilinos Integration testing. |
One cause of timeouts is running CUDA tests in parallel with CTest, because you can have multiple tests competing for GPU resources and running out of memory when running the tests individually would not run out of memory. UVM will exacerbate this. |
Ok another issue is that some of those DEBUG failures only happen on Kepler, while our debug builds of Kokkos proper happen on Pascal or Volta. I guess we need more testing ..... |
I think @rppawlo confirmed that that is happening. But these tests passed in the debug build before this update. I will try to disable the specific unit tests as recommended by @ndellingwood above. But since these pass with CUDA 9.0 I will try to disable them only for the CUDA 8.0 builds. |
I think it would be more appropriate not to use CTest parallelism for CUDA builds, until the parallel test system is sophisticated enough to associate each test with a different GPU. This will entirely prevent non-deterministic failures due to resource exhaustion. |
We could experiment to see but I think the increase in wall clock time would be pretty substantial. Also, every Trilinos test does not use CUDA. |
@bartlettroscoe CUDA builds are special; I'm OK if tests take longer. I want deterministic tests :D |
I've been following this conversation with some interest. I believe the solution of turning off certain tests in select builds is fine (I would prefer smaller tests, but...). I feel that running with "-O0 -g" is important given the design choices being made in Kokkos, Tpetra and up. The use of templates and inline, being hard to understand when things are not optimized out. Being able to walk through a complete stack is useful (also some people use c style asserts...). |
I'm with @eric-c-cyr -- I think it could help to have at least one Dashboard build with |
Ok, we will take a look at the sparse tests in Kokkos Kernels. |
…os-kernels-unit-tests Selective disables of a few individaul unit tests in Kokkos and KokkosKernels for cuda-debug build on 'white' and 'ride'. Should address the remaining failures for #2827.
FYI: PR #2927 was just merged. That should address the last of the Kokkos and KokkosKernels failures/timeouts. We should get confirmation for the ATDM Trilinos builds run on 6/13/2018 and then we can close this issue. |
There were some more timed-out Kokkos and KokkosKernels tests shown here that did not timeout the last two days after my commit e6a1b58 which was merged in the PR #2927 on 6/12/2018. I went back and did a more detailed analysis of the Kokkos and KokkosKernels test suite for debug builds over the last week looking for timeouts or near timeouts. From that detailed analysis of CDash data (which took me about 3 hours to complete), I would like to suggest we disable the following additional additional unit tests:
What is important to note that this would only disable the I will create a PR with these very targeted individual unit test disables. DETAILED NOTES: (click to expand)Starting today 6/15/2018, we are seeing some new Kokkos and KokkosKernels test timeouts in
As I write this, full test results are not in for all of the Trilinos ATDM builds be we are also seeing the test
These same tests were timing out in the It looks like we are going to need to disable these more expensive individual unit tests in other 'debug' builds as well. But first let's look at the history for these tests over the last few days:
I don't think any changes to Trilinos would have impacted these builds on 'white' or 'ride'. What seems to have caused these timeouts to show up today but not the last couple of days is that we did not get test results on 'white' or 'ride' due to the 'bsub' command cashing (like it does about 1/2 of the time and has so for the last 4 months) and the fact that 'ride' was offline for a while. As for the builds on 'hansen', what changed today is that these builds are now property running on the 'hansen' compute nodes 'hansen02'-'hansen04' using Now to do a more though search for which KokkosKernels tests might be in trouble over the last week to try to make sure that I identify where tests in which builds are timing out (or very close to timing out) and need to have these individual unit tests disabled. This query shows all of the KokkosKernels tests for all of the 'debug' builds between 6/9/2018 and 6/15/2018. If you sort by test name then "Proc Time" you see the following:
Now let's examine the most expensive Kokkos tests over this time period and look for trouble in this query:
So with that analysis complete, I think that we should add the the additional individual unit test disables:
|
I do wish we would get rid of the |
@mhoemmen, Yes! Just one global ordinal type that is guaranteed to be 64 bit on a 64 bit machine and and will be a signed type so you don't need to worry about strange behavior from computing < 0. How do we make that happen? |
One word on the test run times: quite a bit of the longer running tests are running that long because we have to catch non-deterministic error sources (e.g. race conditions). There is no 100% reliable way of doing this, but in most cases the likelihood of catching it is some kind of asymptotic thing (e.g. double the runtime catch the next 50% of errors). -O0 and debug kills both all the inlining and adds bounds checking, so every data access gets exorbitantly expensive. So I think the only way of handling this is turning off tests. |
@crtrott, catching race conditions and other non-deterministic behavior is not the only defects that can exist. There are also off-by-one errors, incorrect memory deallocation and other invalid usage that can be caught with debug-mode runtime checking but may not be caught with a full optimized build (with debug checking disabled). Therefore, it would be good if we could run the entire test suite in a full debug-mode as well to catch those types of errors, but perhaps use smaller arrays and less iterations (i.e. not trying to catch non-deterministic failures, just trying to catch these other types of failures). Could these unit test executables be given some type of command-line argument that could be used to reduce the size of arrays or reduce the number of iterations? That way, a full debug-mode build could run these with reduced cost. Could that be supported? There are just a few problematic tests that would need to be addressed. I think we are losing testing by just disabling tests but at the same time, we can't have individaul tests that take 20+ minutes to run. |
…some debug builds (trilinos#2827) These very targeted disables should allow these tests to all complete in well under 10 minutes in all of these debug builds on all of these platforms. See the diffs to see exactly what unit tests are disabled in what unit test executables in what builds on what platforms. For details on why these are being disabled, see trilinos#2827.
@bartlettroscoe wrote:
Tpetra has explicitly declared its intention to deprecate and remove support all Tpetra uses the The smarter thing for kokkos-kernels to do would be only to instantiate for the type that Tpetra uses. Currently, this is the default offset type, but in the future, we plan to change Tpetra's offset type to |
…some debug builds (trilinos#2827) These very targeted disables should allow these tests to all complete in well under 10 minutes in all of these debug builds on all of these platforms. See the diffs to see exactly what unit tests are disabled in what unit test executables in what builds on what platforms. For details on why these are being disabled, see trilinos#2827.
@crtrott, @srajama1, and @ibaned, What about the idea that I floated above to limit the number the matrix and array sizes and the number of iterations for these tests when I could perhaps prototype what I am talking about in a PR for a Kokkos and KokksKernels unit test to show you what I am talking about. Does that sound reasonable? |
There were no failing or timing-out Kokkos or KokkosKernels tests for the past two days as shown in this query. And we can see, for example, the test This issue is resolved. Closing as compete. |
FYI: it looks like the selective disable of some of the KokkosKernels unit tests in PR #2964 merged to 'develop' on 6/19/2018 did not eliminate all of the timeouts of these tests as shown in this query which showed the timeout:
But since this is just one timeout, we should leave this for now and see if this is a recurring problem. Again, random failures like that across all of the various packages and builds will add up and cause automated processes to update Trilinos versions to 'master' or the ATDM APP Trilinos mirror repos to fail more frequently. Therefore, we have to be on top of every randomly failing test in every package in every ATDM build. |
same as what was done in trilinos#2964
same as what was done in trilinos#2964
same as what was done in trilinos#2964
CC: @trilinos/kokkos, @trilinos/kokkos-kernels, @trilinos/panzer, @ndellingwood
Next Action Status
Kokkos, KokkosKernels, and Panzer failing and timing-out tests have been fixed by PRs #2863, #2874, #2927, and #2964 . No Panzer, Kokkos or KokkosKernels failures observed 6/19 or 6/20/2018.
Description
The Kokkos and KokkosKernels updates in the recent commits 51cb7c5 and 816e703:
seem to have triggered several new test failures and timeouts in the packages in Kokkos, KokkosKernels, and Panzer as shown in:
The new failing and timing-out tests are:
which failed in one or more of the unique builds:
These are all basically CUDA 8.0 builds.
These commits were shown pulled in this testing day at:
Steps to Reproduce
The most failures are produced on the
Trilinos-atdm-white-ride-cuda-debug
build on 'white' and 'ride' so that is likely the bet bet to use to reproduce these failures. Therefore, as described in:after logging into 'white' or 'ride' and cloning the Trilinos Git repo (pointed to by
TRILINOS_DIR
) and getting on the 'develop' branch, one would do:The text was updated successfully, but these errors were encountered: