alpaka-related updates #41340

fwyzard · 2023-04-14T00:46:01Z

I'm collecting some "to do" items to improve the alpaka-related developments:

provide a simple way to gracefully skip tests when a given back-end does not have any devices (e.g. no AMD gpus while testing the ROCm back-end); something similar to cms::cudatest::requireDevices(); possibly also some explicit support for Catch2 tests.
remove from the release support for the "old" GPU framework: DataFormats/Portable/interface/Product.h, HeterogeneousCore/AlpakaCore/interface/ScopedContext.h, etc.

The text was updated successfully, but these errors were encountered:

fwyzard · 2023-04-14T00:46:11Z

assign heterogeneous

cmsbuild · 2023-04-14T00:46:22Z

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2023-04-14T00:46:24Z

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

fwyzard · 2023-04-14T01:05:00Z

#41341 should address the second bullet, at least in part.
To be checked if there are any other leftovers.

makortel · 2023-04-14T01:30:23Z

On the first point I wonder if we could (at least eventually) utilize scram itself to avoid running the tests in the first place that would fail because of missing hardware? @smuzaffar

fwyzard · 2023-04-14T07:39:43Z

Interesting idea.

For a CUDA-only test we could do something like this:

<bin name="cudaTest" file="cudaTest.cu">
  <use name="cuda"/>
  <flags TEST_RUNNER_CMD="cudaIsEnabled &amp;&amp; ./cudaTest || echo 'Failed to initialise the CUDA runtime, the test will be skipped.'"/>
</bin>

For an Alpaka-based test there are two problems:

the name of the test binary is different for each back-end
the name of the check could be different for each back-end

The first problem could be solved if SCRAM could provide an environment variable that expands to the name of the binary (or to its full path) that can be used inside TEST_RUNNER_CMD, something like

  <flags TEST_RUNNER_CMD="cudaIsEnabled &amp;&amp; ./$TEST_BINARY_NAME || echo 'Failed to initialise the CUDA runtime, the test $TEST_BINARY_NAME will be skipped.'"/>

The second problem could be solved is SCRAM could provide an environment variable with the name of the Alpaka backend being tested, like SerialSync or ROCmAsync, something like

  <flags TEST_RUNNER_CMD="alpakaIsEnabled$TEST_ALPAKA_BACKEND &amp;&amp; ./$TEST_BINARY_NAME || echo 'Failed to initialise the $TEST_ALPAKA_BACKEND backend, the test $TEST_BINARY_NAME will be skipped.'"/>

Of course, it would be even nicer if SCRAM could automate this with a single flag, something like

  <flags TEST_REQUIRE_ALPAKA_BACKEND/>

and report these tests as Skip in the console, instead of Pass/Fail.

fwyzard · 2023-04-14T07:40:35Z

And to clarify: alpakaIsEnabled$TEST_ALPAKA_BACKEND does not exist yet, but it's relatively trivial to write; I'll see if I can make a PR for it.

smuzaffar · 2023-04-14T11:37:49Z

@fwyzard, I was also thinking along the lines of using cudaIsEnabled and rocmIsEnabled or better alpakaIsEnabled$TEST_ALPAKA_BACKEND or alpakaIsEnabled rocm|cuda|serial|tbb|.... We can change build rules so that unit test is run only if it explicitly depend on

rocm and rocmIsEnabled or alpakaIsEnabledROCmAsync runs successfully
cuda and cudaIsEnabled or alpakaIsEnabledCudaAsync runs successfully
alpaka and alpakaIsEnabled$TEST_ALPAKA_BACKEND (for selected backend) runs successfully.
We should also have unit tests to actually make sure that alpakaIsEnabled$TEST_ALPAKA_BACKEND actually runs if hardware is available e.g. unit tests like

# For GPU IBs or if explicitly enabled cuda tests via env i.e. SCRAM_ENABLE_TESTS=cuda
<iftest name="cuda">
  <test name="alpakaIsEnabledCudaAsync" command="alpakaIsEnabledCudaAsync"/>
</iftest>
# For ROMc Ibs or if explicitly enabled rocm tests via enve.g. SCRAM_ENABLE_TESTS=rocm
<iftest name="rocm">
  <test name="alpakaIsEnabledROCmAsync" command="alpakaIsEnabledROCmAsync"/>
</iftest>

SCRAM should run alpakaIsEnabled$TEST_ALPAKA_BACKEND for each supported backend once ( before the start of tests) and reuse the results.

@aandvalenzuela is interested in improving the GPU/Alpaka tests. if the above sounds good then she can already look in to implement it.

fwyzard · 2023-04-14T11:57:38Z

Looks good to me.

I would suggest alpakaIsEnabled$TEST_ALPAKA_BACKEND rather than alpakaIsEnabled rocm|cuda|serial|tbb|..., because it is much simpler to make the back-end choice at compile time (and so end up with a different binary for each back-end) than at runtime (to support the choice based on the command line argument).

We have actually done the latter for pixeltrack-standalone, but not in CMSSW where we are relying on the plug-in system.

smuzaffar · 2023-04-18T16:45:28Z

@makortel @fwyzard , cms-sw/cmssw-config#95 implements the new unit test rules for cuda/rocm/alpaka backends. See cms-sw/cmssw-config#95 (comment) and let me know if this is sufficient?

fwyzard · 2023-04-19T10:56:43Z

@smuzaffar from the description it looks good.

Since the change impacts both "cuda" and "alpaka-cuda" tests, maybe it would actually be better to check cudaIsEnabled (and rocmIsEnabled for ROCm) instead of introducing an alpaka-specific check ?

makortel · 2023-04-19T13:21:58Z

@smuzaffar From the descriptions it looks good to me too. My only question is about tests that run cmsRun (e.g. directly or via a shell script). From cms-sw/cmssw-config#95 I understand the behavior is driven by the dependencies of the test. Is the <iftest name="cuda"> (from #41340 (comment)) the way to control the execution of those tests?

smuzaffar · 2023-04-19T13:46:27Z

@makortel , <iftest name="cuda"> is not needed any more. Execution of the test is driven by the test dependency itself. All tests with direct cuda deps (i.e. they have either <use name="cuda"> or alpaka-cuda backend enabled) will be skipped if cudaIsEnabled does not run successfully. Same for rocm based tests.

For dedicated GPU IBs, we can set USER_UNIT_TESTS=cuda which will force run only unit tests with cuda or alpaka-cuda dependency (all other tests will be skipped). This will also allow us to run GPU unit tests for PRs :-)

makortel · 2023-04-19T14:17:28Z

@smuzaffar Thanks, but I'm still confused :) Let's take a concrete example

cmssw/HeterogeneousCore/AlpakaTest/test/BuildFile.xml

Lines 6 to 10 in 5b979f2

    
           <ifrelease name="_GPU_"> 
        
             <test name="testHeterogeneousCoreAlpakaTestModules" command="testAlpakaModules.sh cuda"/> 
        
           <else/> 
        
             <test name="testHeterogeneousCoreAlpakaTestModules" command="testAlpakaModules.sh cpu"/> 
        
           </ifrelease>

Here the goal is to run the test script in one way when cuda tests should be run (i.e. either the machine has a GPU or always in the GPU IBs), and in other way when there are no GPUs. The script internally calls cudaIsEnabled for the cpu option to enable the CUDA code path outside of GPU IBs if the machine has a GPU.

Should this be expressed along

<test name="testHeterogeneousCoreAlpakaTestModulesCUDA" command="testAlpakaModules.sh cuda">
  <use name="cuda"/>
</test>
<test name="testHeterogeneousCoreAlpakaTestModulesCPU" command="testAlpakaModules.sh cpu"/>

then? (and probably making the cpu option to not do anything if the machine has a GPU)

smuzaffar · 2023-04-19T14:44:14Z

yes @makortel that fregment of BuildFile should be change as you suggested.

With explicit USER_UNIT_TESTS=cuda (e.g. for GPU IBs), scram will only run testHeterogeneousCoreAlpakaTestModulesCUDA and skip testHeterogeneousCoreAlpakaTestModulesCPU (as this does not depend on cuda)
For non-GPU IBs (i.e. no explicit USER_UNIT_TESTS), scram will run testHeterogeneousCoreAlpakaTestModulesCUDA if cudaIsEnabled is successful e.g. on our ppc64le nodes (where we have GPU). scram will always run testHeterogeneousCoreAlpakaTestModulesCPU .

testAlpakaModules.sh script can be updated to not check for cudaIsEnabled and always do what is passed via command-line.

fwyzard · 2023-04-19T14:45:47Z

testAlpakaModules.sh script can be updated to not check for cudaIsEnabled and always do what is passed via command-line.

... and adding a ROCm version.

Profit from cms-sw#41340

smuzaffar · 2023-04-24T10:48:04Z

By the way, new build rules, to make use of USER_UNIT_TESTS=cuda to run cuda specific tests , are in 13.0.X and 13.1.X IBs.

Profit from cms-sw#41340

makortel · 2024-02-14T22:18:50Z

@fwyzard This issue has been completed, right?

fwyzard · 2024-02-14T23:38:36Z

I don't remember what it was about... I'll have to re-read the thread and remind me of the details.

fwyzard · 2024-02-19T13:44:10Z

Looking at the tests, I think we should still introduce a way to make them fail gracefully instead of crashing, when no GPU is available.

For example:

$ ./deviceVertexFinderByDensity_tROCmAsync

 *** Break *** segmentation violation



===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
   0x7f06884fb731 <waitpid+33>: ja     0x7f06884fb788 <waitpid+120>
   0x7f06884fb733 <waitpid+35>: ret    
   0x7f06884fb734 <waitpid+36>: nop    DWORD PTR [rax+0x0]
   0x7f06884fb738 <waitpid+40>: push   r12
   0x7f06884fb73a <waitpid+42>: mov    r12d,edx
   0x7f06884fb73d <waitpid+45>: push   rbp
   0x7f06884fb73e <waitpid+46>: mov    rbp,rsi
#0  0x00007f06884fb72b in waitpid () from /lib64/libc.so.6
#1  0x00007f068845ccf7 in do_system () from /lib64/libc.so.6
#2  0x00007f068fd1881d in TUnixSystem::StackTrace() () from /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el8_amd64_gcc12/lib/libCore.so
#3  0x00007f068fd181d4 in TUnixSystem::DispatchSignals(ESignals) () from /data/cmssw/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el8_amd64_gcc12/lib/libCore.so
#4  <signal handler called>
#5  std::__shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#6  std::shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> > >::shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/bits/shared_ptr.h:204
#7  alpaka::DevUniformCudaHipRt<alpaka::ApiHipRt>::DevUniformCudaHipRt (this=0x7ffca4511f00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#8  main () at src/RecoTracker/PixelVertexFinding/test/alpaka/VertexFinder_t.cc:28
===========================================================


The lines below might hint at the cause of the crash. If you see question
marks as part of the stack trace, try to recompile with debugging information
enabled and export CLING_DEBUG=1 environment variable before running.
You may get help by asking at the ROOT forum https://root.cern/forum
preferably using the command (.forum bug) in the ROOT prompt.
Only if you are really convinced it is a bug in ROOT then please submit a
report at https://root.cern/bugs or (preferably) using the command (.gh bug) in
the ROOT prompt. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  std::__shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#6  std::shared_ptr<alpaka::detail::QueueRegistry<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRtImpl<alpaka::ApiHipRt> > >::shared_ptr (this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/gcc/12.3.1-40d504be6370b5a30e3947a6e575ca28/include/c++/12.3.1/bits/shared_ptr.h:204
#7  alpaka::DevUniformCudaHipRt<alpaka::ApiHipRt>::DevUniformCudaHipRt (this=0x7ffca4511f00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0-el8_amd64_gcc12/build/CMSSW_14_0_0-build/el8_amd64_gcc12/external/alpaka/1.1.0-0e0b978d445f7af747cf00064c146356/include/alpaka/dev/DevUniformCudaHipRt.hpp:56
#8  main () at src/RecoTracker/PixelVertexFinding/test/alpaka/VertexFinder_t.cc:28
===========================================================

makortel · 2024-02-19T15:22:56Z

Playing around with scram b runtests_<test_name> I think having scram b runtests_deviceVertexFinderByDensity_t to run (or skip) the tests of all the backends of the test could be useful as well, rather than having to explicitly call scram b runtests_deviceVertexFinderByDensity_tROCmAsync.

fwyzard · 2024-02-20T23:18:21Z

Looking at the tests, I think we should still introduce a way to make them fail gracefully instead of crashing, when no GPU is available.

This is actually as simple as

  if (cms::alpakatools::devices<Platform>().empty()) {
    std::cout << "No devices found, the test will be skipped.\n";
    exit(EXIT_SUCCESS);
  }

@makortel, do you think we need to wrap it in a function like cms::alpakatools::requireDevices<Platform>(), or is it simple enough that people can use those four lines directly ?

fwyzard · 2024-02-20T23:39:45Z

I've updated an example in #44036 .

makortel · 2024-02-20T23:41:54Z

This is actually as simple as

  if (cms::alpakatools::devices<Platform>().empty()) {
    std::cout << "No devices found, the test will be skipped.\n";
    exit(EXIT_SUCCESS);
  }

This is similar to what we had at one point (and still have with direct CUDA). My concern for this specific approach is that an infrastructure problem, e.g. in GPU IBs, would go unnoticed because the tests report success (by default scram uses exit code of cudaIsEnabled to decide where to run or skip cuda-dependent tests, but in GPU IB tests the USER_UNIT_TESTS=cuda is set to require all the GPU tests to be run).

If the test would return EXIT_FAILURE in this case (i.e. the test still fails, but with a meaningful message), I'd be fine with it.

I don't have strong feelings whether to abstract or copy-paste. Four lines isn't much, but it's still something. On the other hand, e.g. in Catch2-based tests one could do

  REQUIRE(not cms::alpakatools::devices<Platform>().empty());

which I wouldn't consider to be worth of abstracting.

fwyzard · 2024-02-20T23:43:59Z

Sure, I'm OK with returning EXIT_FAILURE.

In fact I coded it that way, then I switched to EXIT_SUCCESS because that's what we used in CUDA.

fwyzard · 2024-02-20T23:44:34Z

(will fix tomorrow)

fwyzard · 2024-02-21T07:53:10Z

For Catch2, currently we have a message at the beginning of the output coming from a cout:

(original)

I'd like to move it to use a Catch2 macro like FAIL or INFO.
Which one do you like better:

(option 1)

or

(option2)

or the original one ?

fwyzard · 2024-02-21T08:20:50Z

An other option, using WARN:

(option 3)

fwyzard · 2024-02-21T14:32:22Z

#44036 / #44042 should change all alpaka-based tests to fail with a warning message when there are no devices.

makortel · 2024-02-21T14:47:14Z

I think

would be sufficient, and the message is most concise.

After that, I'd prefer

over

fwyzard · 2024-02-21T15:04:42Z

Mhm... I don't see the quoted images :-(

Note that the message should be the same in all cases, it's just a matter of choice.

Profit from cms-sw#41340

makortel · 2024-02-21T15:11:08Z

Ok, third attempt to quote pictures...

makortel · 2024-02-21T15:13:43Z

Going back to text then. My (not very strong) order of preferences would be

option 2
option 1
option 3

fwyzard · 2024-02-21T15:15:07Z

OK, so directly FAIL with the message, instead of issuing a INFO/WARN message and failing on the REQUIRE.

fwyzard · 2024-02-21T15:15:22Z

I'll update the PRs accordingly.

Profit from cms-sw#41340

cmsbuild added pending-signatures heterogeneous-pending labels Apr 14, 2023

iarspider added a commit to iarspider-cmssw/cmssw that referenced this issue Apr 20, 2023

Update BuildFile.xml

070ac49

Profit from cms-sw#41340

makortel mentioned this issue May 5, 2023

Migrate AlpakaTest unit tests to use the new test declarations cms-sw/framework-team#564

Closed

makortel mentioned this issue Oct 24, 2023

Ports prefixScan, OneToManyAssoc and HistoContainer from CUDAUtilities. #43064

Merged

makortel mentioned this issue Nov 6, 2023

Make AlpakaTest tests to use their dependencies for controlling their execution, and extend module tests to ROCm. #43204

Merged

valsdav pushed a commit to valsdav/cmssw that referenced this issue Dec 19, 2023

Update BuildFile.xml

69a1caf

Profit from cms-sw#41340

rbhattacharya04 pushed a commit to rbhattacharya04/cmssw that referenced this issue Jan 26, 2024

Update BuildFile.xml

5feaf44

Profit from cms-sw#41340

zhenbinwu pushed a commit to zhenbinwu/cmssw that referenced this issue Feb 14, 2024

Update BuildFile.xml

bbfe675

Profit from cms-sw#41340

rbhattacharya04 pushed a commit to rbhattacharya04/cmssw that referenced this issue Feb 21, 2024

Update BuildFile.xml

ee304f4

Profit from cms-sw#41340

rbhattacharya04 pushed a commit to rbhattacharya04/cmssw that referenced this issue Mar 7, 2024

Update BuildFile.xml

c1de096

Profit from cms-sw#41340

alpaka-related updates #41340

alpaka-related updates #41340

Comments

fwyzard commented Apr 14, 2023 • edited Loading

fwyzard commented Apr 14, 2023

cmsbuild commented Apr 14, 2023

cmsbuild commented Apr 14, 2023 • edited Loading

fwyzard commented Apr 14, 2023

makortel commented Apr 14, 2023

fwyzard commented Apr 14, 2023

fwyzard commented Apr 14, 2023

smuzaffar commented Apr 14, 2023 • edited Loading

fwyzard commented Apr 14, 2023

smuzaffar commented Apr 18, 2023

fwyzard commented Apr 19, 2023

makortel commented Apr 19, 2023

smuzaffar commented Apr 19, 2023 • edited Loading

makortel commented Apr 19, 2023

smuzaffar commented Apr 19, 2023 • edited Loading

fwyzard commented Apr 19, 2023

smuzaffar commented Apr 24, 2023

makortel commented Feb 14, 2024

fwyzard commented Feb 14, 2024

fwyzard commented Feb 19, 2024 • edited Loading

makortel commented Feb 19, 2024

fwyzard commented Feb 20, 2024

fwyzard commented Feb 20, 2024

makortel commented Feb 20, 2024

fwyzard commented Feb 20, 2024

fwyzard commented Feb 20, 2024

fwyzard commented Feb 21, 2024 • edited Loading

fwyzard commented Feb 21, 2024 • edited Loading

fwyzard commented Feb 21, 2024

makortel commented Feb 21, 2024 • edited Loading

fwyzard commented Feb 21, 2024

makortel commented Feb 21, 2024

makortel commented Feb 21, 2024

fwyzard commented Feb 21, 2024

fwyzard commented Feb 21, 2024

fwyzard commented Apr 14, 2023 •

edited

Loading

cmsbuild commented Apr 14, 2023 •

edited

Loading

smuzaffar commented Apr 14, 2023 •

edited

Loading

smuzaffar commented Apr 19, 2023 •

edited

Loading

smuzaffar commented Apr 19, 2023 •

edited

Loading

fwyzard commented Feb 19, 2024 •

edited

Loading

fwyzard commented Feb 21, 2024 •

edited

Loading

fwyzard commented Feb 21, 2024 •

edited

Loading

makortel commented Feb 21, 2024 •

edited

Loading