-
Notifications
You must be signed in to change notification settings - Fork 572
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ctest -j and kokkos and hwloc #1104
Comments
@bathmatt
I don't have a better answer, but this works OK for me. I que these as batch jobs, that configure + build. On Cray, if you don't want to use batch jobs, you need to configure inside an interactive job, because the path to APRUN is different in the batch env versus on login. I've requested they make this path consistent, because it breaks the ability to configure/compile on login, then create an interactive job and run ctest. I suspect I am the only person experiencing this headache. |
Thanks for the suggestion, I haven't gotten to mutrino yet, still working other platforms :) |
I had a conversation with @crtrott about this issue. In a nutshell, the parallel tests are getting bound to the same core, thus the oversubscription and subsequent performance degradation. His strategy is to do binding to the socket and then allow the OS to manage the placement / migration within the socket. This can look different across the spectrum of OpenMP, Kokkos, etc. and different implementations of MPI, so we could talk about the specific configurations you are trying. |
@olivier-snl I don't have really a standard configuration, I can look at removing hwloc if you think that would help? Ideally I'd want some procedure that works on the major test beds. mutrino/ellis/shiller/rhel6 and allows me to parallelize my tests. If you have time next week maybe we sit down and chat over code? we can do this virutally |
@bathmatt Sure. Let's follow up off-thread to arrange. |
@bathmatt @olivier-snl one of the issues here is that Kokkos utilizes HWLOC ahead of OpenMP. There is a plan to back off on this by checking the environment for |
@nmhamster Yes, I was aware of something along those lines. I think it addresses part of the problem, which is Kokkos/OpenMP. The other parts of the problem seem to be ctest itself, and MPI when used. |
@olivier-snl - @bartlettroscoe mentioned that we might be able to use the KitWare contract to ask them to look into this a little bit. What we are really asking is for them to be scheduler aware. One method this could work is for them to check the environment for |
@nmhamster That would be extremely helpful. In my discussions with @crtrott he indicates that he is seeing ctest launch multiple of the simultaneous test executions to the same core(s). My reading of some of the ctest docs is that some test users actually want this oversubscription, presumably because they are testing for correctness but not for performance. Oversubscription not a good fit for us, of course. |
@olivier-snl - I think what is happening is that |
@nmhamster Yes, either the same Trilinos-configured MPI binding, or a default binding chosen by the MPI implementation, is being replicated across the tests and oversubscribed it seems. |
Yes, this would fall under the current Kitware contract supports this SNL projects that we are not allowed to name here but cares a lot about this stuff. Can we set up a short meeting to discuss this so that I understand what is really needed from CTest and how our CMake projects (e.g. using TriBITS) will be able to hook into that (hopefully seamlessly)? Who needs to attend this meeting? Once I understand what is needed from CTest, I can bring this up at a future Kitware meeting and get something put on the backlog for them to work on. But this will require upgrding the version of CMake/CTest being used on all platforms where HWLOC is used (and conditional logic will need to be added to TriBITS for if the CTest feature is there or not). Is everyone ready for that? @bathmatt, is your team ready to upgrade CMake/CTest to take advantage of this? In the past, you expressed some trepidation with upgrading CMake/CTest on various machines (for example, to take better advantage of Ninja). |
@bartlettroscoe On the SNL side, I'd suggest to invite @crtrott @nmhamster @bathmatt @olivier-snl @bathmatt but all may not be necessary. |
We had a meeting with Kitware staff and they will add support to ctest to better handle pinning tests to cores to not overlap tests on the same cores. This will be tracked in: |
@nmhamster @rppawlo
Is there any procedure on how to test in parallel on the various systems
kokkos/kokkos#630
points out an issue with openmp and thread binding.
I'm looking for a recipe on what do I configure with, and how do I test for the different platforms. Particularly for openmpi/RHEL, ellis, ride, shiller (cpu and gpu).
What are people using? Are you just reverting to -j1?
The text was updated successfully, but these errors were encountered: