Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

examples fail on kepler GPU #313

Closed
rainli323 opened this issue Apr 12, 2022 · 19 comments
Closed

examples fail on kepler GPU #313

rainli323 opened this issue Apr 12, 2022 · 19 comments
Labels

Comments

@rainli323
Copy link

I'm trying to use this project but many examples fail on machines that have older GPUs. I have tried on a few machines with Tesla K40c, and one Titan V. Most tests passed on the Titan V except two, but many tests do not pass on K40c, including many flavors of VectorAdd. Could you please help? Here are my specs:

--failed calculations and configurations--
Tesla K40c (compute capability 3.5)
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4
cmake version 3.23.0
c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
failed tests: asyncAPI, binaryPartitionCG, event_management, execution_control, inlinePTX, p2pBandwidthLatencyTest, simpleIPC, simpleStreams, stream_management, vectorAdd, vectorAddManaged, vectorAddMapped, vectorAddMMAP

--less failed calculations and configurations--
Titan V (compute capability 7.0)
NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
cmake version 3.23.0
c++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
failed tests: simpleStreams, vectorAddMMAP

@eyalroz
Copy link
Owner

eyalroz commented Apr 12, 2022

First of all - thank you for reporting this. My work on cuda-api-wrappers is not supported by NVIDIA, nor by some GPU-specializing lab, so I don't have access to machines with a range of GPUs to test this on.

Now, the first thing I'm going to need is the exact error messages you're getting from each of the failing example programs. Let's start with the Titan V - what are the failure?

I'll also mention that older cards may simply not support some of the features I'm trying to use with some of my examples. In that case, I'll need to characterize what it is, exactly, that they're unable to do, and work around that.

Finally - please try using the development branch, just in case some recent fix has somehow affected what you're seeing.

@rainli323
Copy link
Author

Thank you for getting back! Here are the output from some of the tests. Titan machine errors may be related to the fact that we have 2 GPUs and the tests run/failed on the device 0, which is GeForce GTX 1050Ti. The Tesla K40c machines also have 2 GPUs, but the device 0 is K40c.

--Titan V, simpleStreams, error may have to do with the fact we have 2 GPUs on the computer--

Device synchronization method set to: heuristic
Setting reps to 100 to demonstrate steady state

> GPU Device 1: "NVIDIA GeForce GTX 1050 Ti" with compute capability 6.1
Device: <NVIDIA GeForce GTX 1050 Ti> canMapHostMemory: Yes
> CUDA Capable: SM 6.1 hardware
> 6 Multiprocessor(s) x 128 (Cores/Multiprocessor) = 768 (Cores)
> scale_factor = 1
> array_size   = 2097152

> Using CPU/GPU Device Synchronization method heuristic

Starting Test
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  Failed recording event 0x055bb23801760 on event 0x0: invalid resource handle
Aborted (core dumped)

--Titan V, vectorAddMMAP, note we have 2 GPUs on the computer, device 0 is NVIDIA GeForce GTX 1050 Ti --

Vector Addition (using virtual memory mapping)
terminate called after throwing an instance of 'cuda::runtime_error'
  what():  Failed loading a module from memory location 0x0560dd38c8bf0within context 0x0560dd384cf50 on device 0: no kernel image is available for execution on the device
Aborted (core dumped)

--Tesla K40c, vectorAdd, also have 2 GPUs on the computer, Device 0 is Tesla K40c--

[Vector addition of 50000 elements]
CUDA kernel launch with 196 blocks of 256 threads
Result verification failed at element 0

--Tesla K40c, simpleStream, also have 2 GPUs on the computer, Device 0 is Tesla K40c--

Device synchronization method set to: heuristic
Setting reps to 100 to demonstrate steady state

> GPU Device 0: "Tesla K40c" with compute capability 3.5
Device: <Tesla K40c> canMapHostMemory: Yes
> CUDA Capable: SM 3.5 hardware
> 15 Multiprocessor(s) x 192 (Cores/Multiprocessor) = 2880 (Cores)
> scale_factor = 1
> array_size   = 2097152

> Using CPU/GPU Device Synchronization method heuristic

Starting Test
memcopy:	1.28864
kernel:		0.041728
non-streamed:	1.27919
4 streams:	1.2807
-------------------------------
0: 0 2500
Result check FAILED.

@eyalroz
Copy link
Owner

eyalroz commented Apr 12, 2022

So, there seems to be some kind of issue with cuMemCpy()ing two regions allocated in the primary context of a device which is not the current one. At least - that's the problem with simpleStreams. I will be looking into it over the next few days.

@eyalroz
Copy link
Owner

eyalroz commented Apr 13, 2022

The first two bugs this has exposed are not too bad. The third one will need a little more work. Most of them can be overcome by making some appropriate device the "current" device, but I'm intentionally sparing my users having to know about a global "current device".

For now, please retry with the HEAD of the development branch, and let me know if/what has changed. Or you can wait until I'm done with #316, which can take another while.

@rainli323
Copy link
Author

Now the Titan V simpleStream test is fixed! However the other 3 tests still behave the same with the same output and error messages. (only memory location values are different)

@eyalroz
Copy link
Owner

eyalroz commented Apr 14, 2022

Ok, about vectorAddMMAP: What is your setting CMAKE_CUDA_ARCHITECTURES? (If you haven't set it, please run ccmake in your build directory and check.) It should be something like 37;61 if your cards are Kepler K40c and Titan V. If you don't see any 30-something value, add ;30 to the value. I suspect the issue is that the fatbin file is generated for the wrong microarchitecture - which would be a CMake issue rather than an issue with my library.

@eyalroz
Copy link
Owner

eyalroz commented Apr 14, 2022

Also, about #316 - I should clarify that it's not actually a bug, nor something that you can't work with. The only problem there is that some API calls require you to have set the current context / current device somehow. So, for example, if you've allocated some pinned host memory with one device being current, and you want to copy that memory, that device has to be current or you might get an error.

@eyalroz
Copy link
Owner

eyalroz commented Apr 14, 2022

About the vectorAdd results verification error - that's the most puzzling issue for me. It supposedly succeeded in everything, except the numbers are wrong. Could you tweak the code to:

  1. Print the disagreeing values and
  2. Count the number of disagreements?

@rainli323
Copy link
Author

Regarding vectorAdd, I printed h_A, h_B, h_C before and after the kernel, they never changed. h_A and h_B have random values, h_C elements are 0.

Regarding CMAKE_CUDA_ARCHTECTURES, both TitianV and Tesla K40c had 52, as CMakeCache.txt indicated. Changing the numbers (to 37 and 61 respectively) did not make a difference in the output of simpleStream

@eyalroz
Copy link
Owner

eyalroz commented Apr 14, 2022

Changing the numbers (to 37 and 61 respectively) did not make a difference in the output of simpleStream

It shouldn't have. But - if you:

  1. re-configure with 37/61 (press c in ccamke)
  2. generate (press g in ccmake)
  3. rebuild (outside of ccmake)

does that affect the other programs? Specifically, vectorAddMMAP?

eyalroz added a commit that referenced this issue Apr 15, 2022
…ext, primary contexts, and ensuring their existence in various circumstanves:

* Renamed: `context::current::detail_::scoped_current_device_fallback_t` -> `scoped_existence_ensurer_t` `context::current::detail_::scoped_context_existence_ensurer`
* context::current::scoped_override_t` now has a ctor which accepts. `primary_context_t&&`'s - to hold on to their PC reference which they are about to let go of.
* Moved: `context::current::scoped_override_t` is now implemented in the multi-wrapper implementations directory; consequently
    * Moved the implementations of  `module_t::get_kernel()` and `module::create<Creator>` to the multi-wrapper directory, since they use `context::current::scoped_override_t`.
    * Added inclusion of `cuda/api/multi_wrapper_impls/module.hpp` to some example code.
* Made a device current in some examples to avoid having no current context when executing certain operations with no wrappers (e.g. memcpy with host-side addresses)
* When allocating managed or pinned-host memory, now increasing the reference of some  context by 1 (choosing the primary context of device 0 since that's the safest), and decreasing it again on destruction. That guarantees that operations involving that allocated memory will not occur with no constructed contexts.
    * Corresponding comment changes on the `allocate()` and `free()` methods for pinned-host and managed memory.
* Factored out the code in `context_t::is_primary()` to a function, `cuda::context::current::detail_::is_primary`, which can now also be used via `cuda::context::current::is_primary()`.
* Kernel launch functions now ensure a launch only occurs / is enqueued within a current context (any context).
* Getting the current device now ensures its primary context is also active (which getting an arbitrary device does not do so).
* Added doxygen comment for `device::detail_::wrap()` mentioning the primary context reference behavior.
@eyalroz eyalroz added the bug label Apr 15, 2022
eyalroz added a commit that referenced this issue Apr 15, 2022
…t context, primary contexts, and ensuring their existence in various circumstanves:

* Renamed: `context::current::detail_::scoped_current_device_fallback_t` -> `scoped_existence_ensurer_t` `context::current::detail_::scoped_context_existence_ensurer`
* context::current::scoped_override_t` now has a ctor which accepts. `primary_context_t&&`'s - to hold on to their PC reference which they are about to let go of.
* Moved: `context::current::scoped_override_t` is now implemented in the multi-wrapper implementations directory; consequently
    * Moved the implementations of  `module_t::get_kernel()` and `module::create<Creator>` to the multi-wrapper directory, since they use `context::current::scoped_override_t`.
    * Added inclusion of `cuda/api/multi_wrapper_impls/module.hpp` to some example code.
* Made a device current in some examples to avoid having no current context when executing certain operations with no wrappers (e.g. memcpy with host-side addresses)
* When allocating managed or pinned-host memory, now increasing the reference of some  context by 1 (choosing the primary context of device 0 since that's the safest), and decreasing it again on destruction. That guarantees that operations involving that allocated memory will not occur with no constructed contexts.
    * Corresponding comment changes on the `allocate()` and `free()` methods for pinned-host and managed memory.
* Factored out the code in `context_t::is_primary()` to a function, `cuda::context::current::detail_::is_primary`, which can now also be used via `cuda::context::current::is_primary()`.
* Kernel launch functions now ensure a launch only occurs / is enqueued within a current context (any context).
* Getting the current device now ensures its primary context is also active (which getting an arbitrary device does not do so).
* Added doxygen comment for `device::detail_::wrap()` mentioning the primary context reference behavior.
eyalroz added a commit that referenced this issue Apr 16, 2022
…t context, primary contexts, and ensuring their existence in various circumstanves:

* Renamed: `context::current::detail_::scoped_current_device_fallback_t` -> `scoped_existence_ensurer_t` `context::current::detail_::scoped_context_existence_ensurer`
* context::current::scoped_override_t` now has a ctor which accepts. `primary_context_t&&`'s - to hold on to their PC reference which they are about to let go of.
* Moved: `context::current::scoped_override_t` is now implemented in the multi-wrapper implementations directory; consequently
    * Moved the implementations of  `module_t::get_kernel()` and `module::create<Creator>` to the multi-wrapper directory, since they use `context::current::scoped_override_t`.
    * Added inclusion of `cuda/api/multi_wrapper_impls/module.hpp` to some example code.
* Made a device current in some examples to avoid having no current context when executing certain operations with no wrappers (e.g. memcpy with host-side addresses)
* When allocating managed or pinned-host memory, now increasing the reference of some  context by 1 (choosing the primary context of device 0 since that's the safest), and decreasing it again on destruction. That guarantees that operations involving that allocated memory will not occur with no constructed contexts.
    * Corresponding comment changes on the `allocate()` and `free()` methods for pinned-host and managed memory.
* Factored out the code in `context_t::is_primary()` to a function, `cuda::context::current::detail_::is_primary`, which can now also be used via `cuda::context::current::is_primary()`.
* Kernel launch functions now ensure a launch only occurs / is enqueued within a current context (any context).
* Getting the current device now ensures its primary context is also active (which getting an arbitrary device does not do so).
* Added doxygen comment for `device::detail_::wrap()` mentioning the primary context reference behavior.
eyalroz added a commit that referenced this issue Apr 21, 2022
…t context, primary contexts, and ensuring their existence in various circumstanves:

* Renamed: `context::current::detail_::scoped_current_device_fallback_t` -> `scoped_existence_ensurer_t` `context::current::detail_::scoped_context_existence_ensurer`
* context::current::scoped_override_t` now has a ctor which accepts. `primary_context_t&&`'s - to hold on to their PC reference which they are about to let go of.
* Moved: `context::current::scoped_override_t` is now implemented in the multi-wrapper implementations directory; consequently
    * Moved the implementations of  `module_t::get_kernel()` and `module::create<Creator>` to the multi-wrapper directory, since they use `context::current::scoped_override_t`.
    * Added inclusion of `cuda/api/multi_wrapper_impls/module.hpp` to some example code.
* Made a device current in some examples to avoid having no current context when executing certain operations with no wrappers (e.g. memcpy with host-side addresses)
* When allocating managed or pinned-host memory, now increasing the reference of some  context by 1 (choosing the primary context of device 0 since that's the safest), and decreasing it again on destruction. That guarantees that operations involving that allocated memory will not occur with no constructed contexts.
    * Corresponding comment changes on the `allocate()` and `free()` methods for pinned-host and managed memory.
* Factored out the code in `context_t::is_primary()` to a function, `cuda::context::current::detail_::is_primary`, which can now also be used via `cuda::context::current::is_primary()`.
* Kernel launch functions now ensure a launch only occurs / is enqueued within a current context (any context).
* Getting the current device now ensures its primary context is also active (which getting an arbitrary device does not do so).
* Added doxygen comment for `device::detail_::wrap()` mentioning the primary context reference behavior.
@eyalroz
Copy link
Owner

eyalroz commented Apr 21, 2022

Reporter, can you please re-check with the latest version of the code (or beta release 0.5.1b3)?

@rainli323
Copy link
Author

Same behavior. With VectorMMAP on TitanV, I got
what(): Failed loading a module from memory location 0x05592436f61f0 within context 0x05592430d3f60 on device 0: no kernel image is available for execution on the device
Let me make sure what I did is correct:

login to the machine with Titan V
cd cuda-api-wrappers/
git log (commit dfbeb0c0311c7d221a0018fb5336a85d2fc762c6 (HEAD -> development, tag: v0.5.1b3, origin/development))
cmake -S . -B build -DBUILD_EXAMPLES=ON .
cd build
ccmake .
CMAKE_CUDA_ARCHITECTURES         52
press c, then i, change 52 to 61, enter, press c, press g
cd ../
cmake --build build
cd build/example/bin
./vectorAddMMAP

@eyalroz
Copy link
Owner

eyalroz commented Apr 22, 2022

cmake -S . -B build -DBUILD_EXAMPLES=ON .

This builds in the current directory, not in build. Drop the extra dot argument at the end of the line.

@rainli323
Copy link
Author

Typo. I didn't have the dot in the first place. results still the same. the successful test simpleStreams suggests the test is run on the GeForce, not Titan V.

@eyalroz
Copy link
Owner

eyalroz commented Apr 22, 2022

@rainli323 : Can you run a verbose build (e.g. make VERBOSE=yes) of do_build_vectorAddMMAP_kernel ? I want to see what exact command-line is being used.

@rainli323
Copy link
Author

Here's what I did:

git log (commit dfbeb0c0311c7d221a0018fb5336a85d2fc762c6 (HEAD -> development, tag: v0.5.1b3, origin/development))
cmake -S . -B build -DBUILD_EXAMPLES=ON
ccmake build/
CMAKE_CUDA_ARCHITECTURES         52->70, 
CUDA_ARCH_FLAGS * -> 70
cd build
make VERBOSE=yes do_build_vectorAddMMAP_kernel

Here's what I got:

/edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -S/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers -B/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build --check-build-system CMakeFiles/Makefile.cmake 0
make  -f CMakeFiles/Makefile2 do_build_vectorAddMMAP_kernel
make[1]: Entering directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
/edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -S/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers -B/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build --check-build-system CMakeFiles/Makefile.cmake 0
/edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -E cmake_progress_start /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/CMakeFiles 1
make  -f CMakeFiles/Makefile2 examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/all
make[2]: Entering directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
make  -f examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/build.make examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/depend
make[3]: Entering directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
cd /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build && /edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -E cmake_depends "Unix Makefiles" /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/examples /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/examples /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/DependInfo.cmake --color=
make[3]: Leaving directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
make  -f examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/build.make examples/CMakeFiles/do_build_vectorAddMMAP_kernel.dir/build
make[3]: Entering directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
[100%] Generating vectorAddMMAP_kernel.fatbin
cd /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/examples && /usr/local/cuda/bin/nvcc -fatbin --generate-code arch=compute_70,code=sm_70 -o bin/vectorAddMMAP_kernel.fatbin /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/examples/modified_cuda_samples/vectorAddMMAP/vectorAdd_kernel.cu
make[3]: Leaving directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
[100%] Built target do_build_vectorAddMMAP_kernel
make[2]: Leaving directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'
/edfs/users/rainli/Packages/cmake/cmake-3.23.0-linux-x86_64/bin/cmake -E cmake_progress_start /edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build/CMakeFiles 0
make[1]: Leaving directory '/edfs/users/rainli/cuda-api-wrappers_developement/cuda-api-wrappers/build'```

This does not generate the executable in build/examples/bin/. Is this what you wanted me to do?

@eyalroz
Copy link
Owner

eyalroz commented Apr 22, 2022

Well, if you build the kernel for a Volta card (7.0), then vectorAddMMAP, which is hard-coded to use your first GPU, the Pascal 6.1 card, will indeed fail to load the kernel.

The thing is, the kernel might not get auto-rebuilt if you change CMAKE_CUDA_ARCHITECTURE. I'm not sure why exactly.

@eyalroz
Copy link
Owner

eyalroz commented Jun 10, 2022

Ping.

eyalroz added a commit that referenced this issue Jun 20, 2022
…t context, primary contexts, and ensuring their existence in various circumstanves:

* Renamed: `context::current::detail_::scoped_current_device_fallback_t` -> `scoped_existence_ensurer_t` `context::current::detail_::scoped_context_existence_ensurer`
* context::current::scoped_override_t` now has a ctor which accepts. `primary_context_t&&`'s - to hold on to their PC reference which they are about to let go of.
* Moved: `context::current::scoped_override_t` is now implemented in the multi-wrapper implementations directory; consequently
    * Moved the implementations of  `module_t::get_kernel()` and `module::create<Creator>` to the multi-wrapper directory, since they use `context::current::scoped_override_t`.
    * Added inclusion of `cuda/api/multi_wrapper_impls/module.hpp` to some example code.
* Made a device current in some examples to avoid having no current context when executing certain operations with no wrappers (e.g. memcpy with host-side addresses)
* When allocating managed or pinned-host memory, now increasing the reference of some  context by 1 (choosing the primary context of device 0 since that's the safest), and decreasing it again on destruction. That guarantees that operations involving that allocated memory will not occur with no constructed contexts.
    * Corresponding comment changes on the `allocate()` and `free()` methods for pinned-host and managed memory.
* Factored out the code in `context_t::is_primary()` to a function, `cuda::context::current::detail_::is_primary`, which can now also be used via `cuda::context::current::is_primary()`.
* Kernel launch functions now ensure a launch only occurs / is enqueued within a current context (any context).
* Getting the current device now ensures its primary context is also active (which getting an arbitrary device does not do so).
* Added doxygen comment for `device::detail_::wrap()` mentioning the primary context reference behavior.
@eyalroz
Copy link
Owner

eyalroz commented Jun 20, 2022

Well, assuming this is resolved. Reopen if you still see failures.

@eyalroz eyalroz closed this as completed Jun 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants