examples fail on kepler GPU #313

rainli323 · 2022-04-12T19:08:18Z

I'm trying to use this project but many examples fail on machines that have older GPUs. I have tried on a few machines with Tesla K40c, and one Titan V. Most tests passed on the Titan V except two, but many tests do not pass on K40c, including many flavors of VectorAdd. Could you please help? Here are my specs:

--failed calculations and configurations--
Tesla K40c (compute capability 3.5)
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4
cmake version 3.23.0
c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
failed tests: asyncAPI, binaryPartitionCG, event_management, execution_control, inlinePTX, p2pBandwidthLatencyTest, simpleIPC, simpleStreams, stream_management, vectorAdd, vectorAddManaged, vectorAddMapped, vectorAddMMAP

--less failed calculations and configurations--
Titan V (compute capability 7.0)
NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
cmake version 3.23.0
c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
failed tests: simpleStreams, vectorAddMMAP

eyalroz · 2022-04-12T19:19:37Z

First of all - thank you for reporting this. My work on cuda-api-wrappers is not supported by NVIDIA, nor by some GPU-specializing lab, so I don't have access to machines with a range of GPUs to test this on.

Now, the first thing I'm going to need is the exact error messages you're getting from each of the failing example programs. Let's start with the Titan V - what are the failure?

I'll also mention that older cards may simply not support some of the features I'm trying to use with some of my examples. In that case, I'll need to characterize what it is, exactly, that they're unable to do, and work around that.

Finally - please try using the development branch, just in case some recent fix has somehow affected what you're seeing.

rainli323 · 2022-04-12T19:49:04Z

Thank you for getting back! Here are the output from some of the tests. Titan machine errors may be related to the fact that we have 2 GPUs and the tests run/failed on the device 0, which is GeForce GTX 1050Ti. The Tesla K40c machines also have 2 GPUs, but the device 0 is K40c.

--Titan V, simpleStreams, error may have to do with the fact we have 2 GPUs on the computer--

Device synchronization method set to: heuristic
Setting reps to 100 to demonstrate steady state

> GPU Device 1: "NVIDIA GeForce GTX 1050 Ti" with compute capability 6.1
Device: <NVIDIA GeForce GTX 1050 Ti> canMapHostMemory: Yes
> CUDA Capable: SM 6.1 hardware
> 6 Multiprocessor(s) x 128 (Cores/Multiprocessor) = 768 (Cores)
> scale_factor = 1
> array_size   = 2097152

> Using CPU/GPU Device Synchronization method heuristic

Starting Test
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  Failed recording event 0x055bb23801760 on event 0x0: invalid resource handle
Aborted (core dumped)

--Titan V, vectorAddMMAP, note we have 2 GPUs on the computer, device 0 is NVIDIA GeForce GTX 1050 Ti --

Vector Addition (using virtual memory mapping)
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  Failed loading a module from memory location 0x0560dd38c8bf0within context 0x0560dd384cf50 on device 0: no kernel image is available for execution on the device
Aborted (core dumped)

--Tesla K40c, vectorAdd, also have 2 GPUs on the computer, Device 0 is Tesla K40c--

[Vector addition of 50000 elements]
CUDA kernel launch with 196 blocks of 256 threads
Result verification failed at element 0

--Tesla K40c, simpleStream, also have 2 GPUs on the computer, Device 0 is Tesla K40c--

Device synchronization method set to: heuristic
Setting reps to 100 to demonstrate steady state

> GPU Device 0: "Tesla K40c" with compute capability 3.5
Device: <Tesla K40c> canMapHostMemory: Yes
> CUDA Capable: SM 3.5 hardware
> 15 Multiprocessor(s) x 192 (Cores/Multiprocessor) = 2880 (Cores)
> scale_factor = 1
> array_size   = 2097152

> Using CPU/GPU Device Synchronization method heuristic

Starting Test
memcopy:	1.28864
kernel:		0.041728
non-streamed:	1.27919
4 streams:	1.2807
-------------------------------
0: 0 2500
Result check FAILED.

eyalroz · 2022-04-12T21:51:09Z

So, there seems to be some kind of issue with cuMemCpy()ing two regions allocated in the primary context of a device which is not the current one. At least - that's the problem with simpleStreams. I will be looking into it over the next few days.

eyalroz · 2022-04-13T22:21:28Z

The first two bugs this has exposed are not too bad. The third one will need a little more work. Most of them can be overcome by making some appropriate device the "current" device, but I'm intentionally sparing my users having to know about a global "current device".

For now, please retry with the HEAD of the development branch, and let me know if/what has changed. Or you can wait until I'm done with #316, which can take another while.

rainli323 · 2022-04-14T02:46:11Z

Now the Titan V simpleStream test is fixed! However the other 3 tests still behave the same with the same output and error messages. (only memory location values are different)

eyalroz · 2022-04-14T09:34:36Z

Ok, about vectorAddMMAP: What is your setting CMAKE_CUDA_ARCHITECTURES? (If you haven't set it, please run ccmake in your build directory and check.) It should be something like 37;61 if your cards are Kepler K40c and Titan V. If you don't see any 30-something value, add ;30 to the value. I suspect the issue is that the fatbin file is generated for the wrong microarchitecture - which would be a CMake issue rather than an issue with my library.

eyalroz · 2022-04-14T09:36:31Z

Also, about #316 - I should clarify that it's not actually a bug, nor something that you can't work with. The only problem there is that some API calls require you to have set the current context / current device somehow. So, for example, if you've allocated some pinned host memory with one device being current, and you want to copy that memory, that device has to be current or you might get an error.

eyalroz · 2022-04-14T09:45:58Z

About the vectorAdd results verification error - that's the most puzzling issue for me. It supposedly succeeded in everything, except the numbers are wrong. Could you tweak the code to:

Print the disagreeing values and
Count the number of disagreements?

rainli323 · 2022-04-14T17:32:25Z

Regarding vectorAdd, I printed h_A, h_B, h_C before and after the kernel, they never changed. h_A and h_B have random values, h_C elements are 0.

Regarding CMAKE_CUDA_ARCHTECTURES, both TitianV and Tesla K40c had 52, as CMakeCache.txt indicated. Changing the numbers (to 37 and 61 respectively) did not make a difference in the output of simpleStream

eyalroz · 2022-04-14T19:10:03Z

Changing the numbers (to 37 and 61 respectively) did not make a difference in the output of simpleStream

It shouldn't have. But - if you:

re-configure with 37/61 (press c in ccamke)
generate (press g in ccmake)
rebuild (outside of ccmake)

does that affect the other programs? Specifically, vectorAddMMAP?

…ext, primary contexts, and ensuring their existence in various circumstanves: * Renamed: `context::current::detail_::scoped_current_device_fallback_t` -> `scoped_existence_ensurer_t` `context::current::detail_::scoped_context_existence_ensurer` * context::current::scoped_override_t` now has a ctor which accepts. `primary_context_t&&`'s - to hold on to their PC reference which they are about to let go of. * Moved: `context::current::scoped_override_t` is now implemented in the multi-wrapper implementations directory; consequently * Moved the implementations of `module_t::get_kernel()` and `module::create<Creator>` to the multi-wrapper directory, since they use `context::current::scoped_override_t`. * Added inclusion of `cuda/api/multi_wrapper_impls/module.hpp` to some example code. * Made a device current in some examples to avoid having no current context when executing certain operations with no wrappers (e.g. memcpy with host-side addresses) * When allocating managed or pinned-host memory, now increasing the reference of some context by 1 (choosing the primary context of device 0 since that's the safest), and decreasing it again on destruction. That guarantees that operations involving that allocated memory will not occur with no constructed contexts. * Corresponding comment changes on the `allocate()` and `free()` methods for pinned-host and managed memory. * Factored out the code in `context_t::is_primary()` to a function, `cuda::context::current::detail_::is_primary`, which can now also be used via `cuda::context::current::is_primary()`. * Kernel launch functions now ensure a launch only occurs / is enqueued within a current context (any context). * Getting the current device now ensures its primary context is also active (which getting an arbitrary device does not do so). * Added doxygen comment for `device::detail_::wrap()` mentioning the primary context reference behavior.

…t context, primary contexts, and ensuring their existence in various circumstanves: * Renamed: `context::current::detail_::scoped_current_device_fallback_t` -> `scoped_existence_ensurer_t` `context::current::detail_::scoped_context_existence_ensurer` * context::current::scoped_override_t` now has a ctor which accepts. `primary_context_t&&`'s - to hold on to their PC reference which they are about to let go of. * Moved: `context::current::scoped_override_t` is now implemented in the multi-wrapper implementations directory; consequently * Moved the implementations of `module_t::get_kernel()` and `module::create<Creator>` to the multi-wrapper directory, since they use `context::current::scoped_override_t`. * Added inclusion of `cuda/api/multi_wrapper_impls/module.hpp` to some example code. * Made a device current in some examples to avoid having no current context when executing certain operations with no wrappers (e.g. memcpy with host-side addresses) * When allocating managed or pinned-host memory, now increasing the reference of some context by 1 (choosing the primary context of device 0 since that's the safest), and decreasing it again on destruction. That guarantees that operations involving that allocated memory will not occur with no constructed contexts. * Corresponding comment changes on the `allocate()` and `free()` methods for pinned-host and managed memory. * Factored out the code in `context_t::is_primary()` to a function, `cuda::context::current::detail_::is_primary`, which can now also be used via `cuda::context::current::is_primary()`. * Kernel launch functions now ensure a launch only occurs / is enqueued within a current context (any context). * Getting the current device now ensures its primary context is also active (which getting an arbitrary device does not do so). * Added doxygen comment for `device::detail_::wrap()` mentioning the primary context reference behavior.

eyalroz · 2022-04-21T21:41:10Z

Reporter, can you please re-check with the latest version of the code (or beta release 0.5.1b3)?

rainli323 · 2022-04-22T00:02:38Z

Same behavior. With VectorMMAP on TitanV, I got
what(): Failed loading a module from memory location 0x05592436f61f0 within context 0x05592430d3f60 on device 0: no kernel image is available for execution on the device
Let me make sure what I did is correct:

login to the machine with Titan V
cd cuda-api-wrappers/
git log (commit dfbeb0c0311c7d221a0018fb5336a85d2fc762c6 (HEAD -> development, tag: v0.5.1b3, origin/development))
cmake -S . -B build -DBUILD_EXAMPLES=ON .
cd build
ccmake .
CMAKE_CUDA_ARCHITECTURES         52
press c, then i, change 52 to 61, enter, press c, press g
cd ../
cmake --build build
cd build/example/bin
./vectorAddMMAP

eyalroz · 2022-04-22T10:44:41Z

cmake -S . -B build -DBUILD_EXAMPLES=ON .

This builds in the current directory, not in build. Drop the extra dot argument at the end of the line.

rainli323 · 2022-04-22T15:24:51Z

Typo. I didn't have the dot in the first place. results still the same. the successful test simpleStreams suggests the test is run on the GeForce, not Titan V.

eyalroz · 2022-04-22T17:42:18Z

@rainli323 : Can you run a verbose build (e.g. make VERBOSE=yes) of do_build_vectorAddMMAP_kernel ? I want to see what exact command-line is being used.

rainli323 · 2022-04-22T19:54:18Z

Here's what I did:

git log (commit dfbeb0c0311c7d221a0018fb5336a85d2fc762c6 (HEAD -> development, tag: v0.5.1b3, origin/development))
cmake -S . -B build -DBUILD_EXAMPLES=ON
ccmake build/
CMAKE_CUDA_ARCHITECTURES         52->70, 
CUDA_ARCH_FLAGS * -> 70
cd build
make VERBOSE=yes do_build_vectorAddMMAP_kernel

Here's what I got:

/edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -S/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers -B/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build --check-build-system CMakeFiles/Makefile.cmake 0
make  -f CMakeFiles/Makefile2 do_build_vectorAddMMAP_kernel
make[1]: Entering directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
/edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -S/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers -B/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build --check-build-system CMakeFiles/Makefile.cmake 0
/edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -E cmake_progress_start /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/CMakeFiles 1
make  -f CMakeFiles/Makefile2 examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/all
make[2]: Entering directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
make  -f examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/build.make examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/depend
make[3]: Entering directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
cd /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build && /edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -E cmake_depends "Unix Makefiles" /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/examples /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/examples /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/DependInfo.cmake --color=
make[3]: Leaving directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
make  -f examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/build.make examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/build
make[3]: Entering directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
[100%] Generating vectorAddMMAP_kernel.fatbin
cd /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/examples && /usr/local/cuda/bin/nvcc -fatbin --generate-code arch=compute_70,code=sm_70 -o bin/vectorAddMMAP_kernel.fatbin /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/examples/modified_cuda_samples/vectorAddMMAP/vectorAdd_kernel.cu
make[3]: Leaving directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
[100%] Built target do_build_vectorAddMMAP_kernel
make[2]: Leaving directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
/edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -E cmake_progress_start /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/CMakeFiles 0
make[1]: Leaving directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'```

This does not generate the executable in build/examples/bin/. Is this what you wanted me to do?

eyalroz · 2022-04-22T20:29:48Z

Well, if you build the kernel for a Volta card (7.0), then vectorAddMMAP, which is hard-coded to use your first GPU, the Pascal 6.1 card, will indeed fail to load the kernel.

The thing is, the kernel might not get auto-rebuilt if you change CMAKE_CUDA_ARCHITECTURE. I'm not sure why exactly.

eyalroz · 2022-06-10T14:50:19Z

Ping.

…t context, primary contexts, and ensuring their existence in various circumstanves: * Renamed: `context::current::detail_::scoped_current_device_fallback_t` -> `scoped_existence_ensurer_t` `context::current::detail_::scoped_context_existence_ensurer` * context::current::scoped_override_t` now has a ctor which accepts. `primary_context_t&&`'s - to hold on to their PC reference which they are about to let go of. * Moved: `context::current::scoped_override_t` is now implemented in the multi-wrapper implementations directory; consequently * Moved the implementations of `module_t::get_kernel()` and `module::create<Creator>` to the multi-wrapper directory, since they use `context::current::scoped_override_t`. * Added inclusion of `cuda/api/multi_wrapper_impls/module.hpp` to some example code. * Made a device current in some examples to avoid having no current context when executing certain operations with no wrappers (e.g. memcpy with host-side addresses) * When allocating managed or pinned-host memory, now increasing the reference of some context by 1 (choosing the primary context of device 0 since that's the safest), and decreasing it again on destruction. That guarantees that operations involving that allocated memory will not occur with no constructed contexts. * Corresponding comment changes on the `allocate()` and `free()` methods for pinned-host and managed memory. * Factored out the code in `context_t::is_primary()` to a function, `cuda::context::current::detail_::is_primary`, which can now also be used via `cuda::context::current::is_primary()`. * Kernel launch functions now ensure a launch only occurs / is enqueued within a current context (any context). * Getting the current device now ensures its primary context is also active (which getting an arbitrary device does not do so). * Added doxygen comment for `device::detail_::wrap()` mentioning the primary context reference behavior.

eyalroz · 2022-06-20T21:19:40Z

Well, assuming this is resolved. Reopen if you still see failures.

This was referenced Apr 13, 2022

Potential failure to set context when recording an event #314

Closed

Launching on the current device instead of the selected device #315

Closed

Cannot rely on scoped_current_device_fallback_t #316

Closed

eyalroz added the bug label Apr 15, 2022

eyalroz closed this as completed Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples fail on kepler GPU #313

examples fail on kepler GPU #313

rainli323 commented Apr 12, 2022

eyalroz commented Apr 12, 2022

rainli323 commented Apr 12, 2022

eyalroz commented Apr 12, 2022

eyalroz commented Apr 13, 2022 •

edited

Loading

rainli323 commented Apr 14, 2022

eyalroz commented Apr 14, 2022 •

edited

Loading

eyalroz commented Apr 14, 2022

eyalroz commented Apr 14, 2022

rainli323 commented Apr 14, 2022

eyalroz commented Apr 14, 2022

eyalroz commented Apr 21, 2022

rainli323 commented Apr 22, 2022

eyalroz commented Apr 22, 2022

rainli323 commented Apr 22, 2022

eyalroz commented Apr 22, 2022

rainli323 commented Apr 22, 2022

eyalroz commented Apr 22, 2022

eyalroz commented Jun 10, 2022

eyalroz commented Jun 20, 2022

examples fail on kepler GPU #313

examples fail on kepler GPU #313

Comments

rainli323 commented Apr 12, 2022

eyalroz commented Apr 12, 2022

rainli323 commented Apr 12, 2022

eyalroz commented Apr 12, 2022

eyalroz commented Apr 13, 2022 • edited Loading

rainli323 commented Apr 14, 2022

eyalroz commented Apr 14, 2022 • edited Loading

eyalroz commented Apr 14, 2022

eyalroz commented Apr 14, 2022

rainli323 commented Apr 14, 2022

eyalroz commented Apr 14, 2022

eyalroz commented Apr 21, 2022

rainli323 commented Apr 22, 2022

eyalroz commented Apr 22, 2022

rainli323 commented Apr 22, 2022

eyalroz commented Apr 22, 2022

rainli323 commented Apr 22, 2022

eyalroz commented Apr 22, 2022

eyalroz commented Jun 10, 2022

eyalroz commented Jun 20, 2022

eyalroz commented Apr 13, 2022 •

edited

Loading

eyalroz commented Apr 14, 2022 •

edited

Loading