Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot compile with GCC 10 or later for the CUDA backend if the kernel functor is templated #1472

Closed
fwyzard opened this issue Nov 17, 2021 · 10 comments · Fixed by #1474
Closed

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Nov 17, 2021

Looks like a change in GCC 10 breaks the compilation for the CUDA backend of kernels written with Alpaka, if the operator() is templated on its arguments other than TAcc.

Here is an extract of the code:

namespace cms {
  namespace alpakatools {

     struct countFromVector {
      template <typename T_Acc, typename Histo, typename T>
      ALPAKA_FN_ACC void operator()(const T_Acc &acc,
                                    Histo *__restrict__ h,
                                    uint32_t nh,
                                    T const *__restrict__ v,
                                    uint32_t const *__restrict__ offsets) const {
        ...
      }
    };

    template <typename Histo, typename T>
    ALPAKA_FN_HOST ALPAKA_FN_INLINE __attribute__((always_inline)) void fillManyFromVector(
        Histo *__restrict__ h,
        uint32_t nh,
        T const *__restrict__ v,
        uint32_t const *__restrict__ offsets,
        uint32_t totSize,
        unsigned int nthreads,
        ::ALPAKA_ACCELERATOR_NAMESPACE::Queue &queue) {
      launchZero(h, queue);

      ...

      alpaka::enqueue(queue,
                      alpaka::createTaskKernel<::ALPAKA_ACCELERATOR_NAMESPACE::Acc1D>(
                          workDiv, countFromVector(), h, nh, v, offsets));
      ...
    }

  }  // namespace alpakatools
}  // namespace cms

When used later on in some application code, e.g.

::cms::alpakatools::fillManyFromVector(hits_d.phiBinner(), 10, hits_d.c_iphi(), hits_d.c_hitsLayerStart(), nHits, 256, queue);

this results in the error

.../external/alpaka/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp: In instantiation of ‘static void alpaka::traits::Enqueue<alpaka::QueueUniformCudaHipRtNonBlocking, alpaka::TaskKernelGpuUniformCudaHipRt<TAcc, TDim, TIdx, TKernelFnObj, TArgs ...>, void>::enqueue(alpaka::QueueUniformCudaHipRtNonBlocking&, const alpaka::TaskKernelGpuUniformCudaHipRt<TAcc, TDim, TIdx, TKernelFnObj, TArgs ...>&) [with TAcc = alpaka::AccGpuCudaRt<std::integral_constant<long unsigned int, 1>, unsigned int>; TDim = std::integral_constant<long unsigned int, 1>; TIdx = unsigned int; TKernelFnObj = cms::alpakatools::countFromVector; TArgs = {cms::alpakatools::HistoContainer<short int, 128, 49152, 16, short unsigned int, 10>* __restrict__&, unsigned int&, const short int* __restrict__&, const unsigned int* __restrict__&}]’:
.../external/alpaka/include/alpaka/queue/Traits.hpp:48:58:   required from ‘void alpaka::enqueue(TQueue&, TTask&&) [with TQueue = alpaka::QueueUniformCudaHipRtNonBlocking; TTask = alpaka::TaskKernelGpuUniformCudaHipRt<alpaka::AccGpuCudaRt<std::integral_constant<long unsigned int, 1>, unsigned int>, std::integral_constant<long unsigned int, 1>, unsigned int, cms::alpakatools::countFromVector, cms::alpakatools::HistoContainer<short int, 128, 49152, 16, short unsigned int, 10>* __restrict__&, unsigned int&, const short int* __restrict__&, const unsigned int* __restrict__&>]’
.../src/alpaka/AlpakaCore/HistoContainer.h:113:16:   required from ‘void cms::alpakatools::fillManyFromVector(Histo*, uint32_t, const T*, const uint32_t*, uint32_t, unsigned int, alpaka_cuda_async::Queue&) [with Histo = cms::alpakatools::HistoContainer<short int, 128, 49152, 16, short unsigned int, 10>; T = short int; uint32_t = unsigned int; alpaka_cuda_async::Queue = alpaka::QueueUniformCudaHipRtNonBlocking]’
.../src/alpaka/plugin-SiPixelRecHits/alpaka/PixelRecHits.cc:72:125:   required from here
.../external/alpaka/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp:243:6: error: no matches converting function ‘uniformCudaHipKernel’ to type ‘void (*)(class alpaka::Vec<std::integral_constant<long unsigned int, 1>, unsigned int>, struct cms::alpakatools::countFromVector, class cms::alpakatools::HistoContainer<short int, 128, 49152, 16, short unsigned int, 10>*, unsigned int, const short int*, const unsigned int*)’
  243 |                 auto kernelName = uniform_cuda_hip::detail::
      |      ^         
.../external/alpaka/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp:74:1: note: candidate is: ‘template<class TAcc, class TDim, class TIdx, class TKernelFnObj, class ... TArgs> void alpaka::uniform_cuda_hip::detail::uniformCudaHipKernel(alpaka::Vec<TDim, TVal>, TKernelFnObj, TArgs ...)’
   74 |             __global__ void uniformCudaHipKernel(
      | ^           ~~~~~~~~

The errors happens with GCC 10.2 and 11.2; the same code compiles fine with GCC 9.

To reproduce:

git clone [email protected]:cms-patatrack/pixeltrack-standalone.git
cd pixeltrack-standalone
git checkout 2e46f69b77d8e0f4c1afd2b316e6b6b5dd884243
# edit the Makefile to point to a recent version of CUDA (default is /usr/local/cuda) and g++ from GCC 10
make alpaka

After having set up the environment and built the dependeicies, a simpler way to reproduce the problem is

nvcc -x cu  \
  -std=c++17  -O3  \
  -DALPAKA_ACC_GPU_CUDA_ENABLED  \
  -DALPAKA_ACC_GPU_CUDA_ONLY_MODE  \
  --expt-relaxed-constexpr --expt-extended-lambda  \
  -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored  \
  --diag-suppress 20014 \
  --cudart=shared  \
  -gencode arch=compute_75,code=[sm_75,compute_75]  \
  -ccbin g++-10  \
  --compiler-options '-fPIC -fdiagnostics-show-option -felide-constructors -fmessage-length=0 -fno-math-errno -ftree-vectorize -fvisibility-inlines-hidden --param vect-max-version-for-alias-checks=50 -msse3 -pipe -pthread'  \
  -DGPU_DEBUG  \
  -Isrc/alpaka  \
  -DSRC_DIR=src/alpaka  \
  -DLIB_DIR=lib/alpaka  \
  -isystem external/alpaka/include  \
  -isystem external/boost/include  \
  -isystem external/libbacktrace/include  \
  -dc \
  -c src/alpaka/test/alpaka/HistoContainer_t.cc  \
  -o HistoContainer_t.cc.cuda.o
@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 17, 2021

The HEAD of our master branch uses Alpaka 0.8.0-rc1, but the same error is there also with Alpaka 0.7.0 .

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 18, 2021

After going through plenty of intermediate code generated by nvcc and preprocessed code generated by g++, I now suspect that the issue may be related to template arguments with the __restrict__ attribute.

Looking at the code snippets above, the kernel has the signature

template <typename T_Acc, typename Histo, typename T> ALPAKA_FN_ACC void struct countFromVector::operator()(
    const T_Acc &acc,
    Histo *__restrict__ h,
    uint32_t nh,
    T const *__restrict__ v,
    uint32_t const *__restrict__ offsets) const;

but the compiler is trying to cast the function that calls it to a function pointer of type (here T = short int)

void (*)(class alpaka::Vec<std::integral_constant<long unsigned int, 1>, unsigned int>,
    struct cms::alpakatools::countFromVector,
    class cms::alpakatools::HistoContainer<short int, 128, 49152, 16, short unsigned int, 10>*,
    unsigned int,
    const short int*,
    const unsigned int*)’

without the __restrict__ attributes.

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 18, 2021

I found a much simpler way to reproduce the problem, modifying the vectorAdd example to exhibit the same issue:

diff --git a/example/vectorAdd/src/vectorAdd.cpp b/example/vectorAdd/src/vectorAdd.cpp
index 3b5de083a72d..07dcc547f66a 100644
--- a/example/vectorAdd/src/vectorAdd.cpp
+++ b/example/vectorAdd/src/vectorAdd.cpp
@@ -163,12 +163,15 @@ auto main() -> int
     VectorAddKernel kernel;
 
     // Create the kernel execution task.
+    Data const* __restrict__ const pBufAccA(alpaka::getPtrNative(bufAccA));
+    Data const* __restrict__ const pBufAccB(alpaka::getPtrNative(bufAccB));
+    Data* __restrict__ const pBufAccC(alpaka::getPtrNative(bufAccC));
     auto const taskKernel = alpaka::createTaskKernel<Acc>(
         workDiv,
         kernel,
-        alpaka::getPtrNative(bufAccA),
-        alpaka::getPtrNative(bufAccB),
-        alpaka::getPtrNative(bufAccC),
+        pBufAccA,
+        pBufAccB,
+        pBufAccC,
         numElements);
 
     // Enqueue the kernel execution task

The modification compile fine with GCC 9:

$ g++ --version
g++ (GCC) 9.2.1 20191120 (Red Hat 9.2.1-2)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ nvcc -std=c++17 -O3 -x cu -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --source-in-ptx -Iinclude -DALPAKA_ACC_GPU_CUDA_ENABLED -DALPAKA_ACC_GPU_CUDA_ONLY_MODE -DALPAKA_DEBUG=ALPAKA_DEBUG_FULL example/vectorAdd/src/vectorAdd.cpp -o vectorAdd

but fail with GCC 10:

$ g++ --version
g++ (GCC) 10.2.1 20201112 (Red Hat 10.2.1-8)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ nvcc -std=c++17 -O3 -x cu -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored --expt-relaxed-constexpr --expt-extended-lambda --generate-line-info --source-in-ptx -Iinclude -DALPAKA_ACC_GPU_CUDA_ENABLED -DALPAKA_ACC_GPU_CUDA_ONLY_MODE -DALPAKA_DEBUG=ALPAKA_DEBUG_FULL example/vectorAdd/src/vectorAdd.cpp -o vectorAdd
include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp: In instantiation of ‘static void alpaka::traits::Enqueue<alpaka::QueueUniformCudaHipRtBlocking, alpaka::TaskKernelGpuUniformCudaHipRt<TAcc, TDim, TIdx, TKernelFnObj, TArgs ...>, void>::enqueue(alpaka::QueueUniformCudaHipRtBlocking&, const alpaka::TaskKernelGpuUniformCudaHipRt<TAcc, TDim, TIdx, TKernelFnObj, TArgs ...>&) [with TAcc = alpaka::AccGpuCudaRt<std::integral_constant<long unsigned int, 1>, long unsigned int>; TDim = std::integral_constant<long unsigned int, 1>; TIdx = long unsigned int; TKernelFnObj = VectorAddKernel; TArgs = {const unsigned int* const __restrict__&, const unsigned int* const __restrict__&, unsigned int* const __restrict__&, const long unsigned int&}]’:
include/alpaka/queue/Traits.hpp:48:58:   required from ‘void alpaka::enqueue(TQueue&, TTask&&) [with TQueue = alpaka::QueueUniformCudaHipRtBlocking; TTask = const alpaka::TaskKernelGpuUniformCudaHipRt<alpaka::AccGpuCudaRt<std::integral_constant<long unsigned int, 1>, long unsigned int>, std::integral_constant<long unsigned int, 1>, long unsigned int, VectorAddKernel, const unsigned int* const __restrict__&, const unsigned int* const __restrict__&, unsigned int* const __restrict__&, const long unsigned int&>&]’
example/vectorAdd/src/vectorAdd.cpp:180:34:   required from here
include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp:374:6: error: no matches converting function ‘uniformCudaHipKernel’ to type ‘void (*)(class alpaka::Vec<std::integral_constant<long unsigned int, 1>, long unsigned int>, struct VectorAddKernel, const unsigned int*, const unsigned int*, unsigned int*, long unsigned int)’
  374 |                 auto kernelName = uniform_cuda_hip::detail::
      |      ^         
include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp:74:1: note: candidate is: ‘template<class TAcc, class TDim, class TIdx, class TKernelFnObj, class ... TArgs> void alpaka::uniform_cuda_hip::detail::uniformCudaHipKernel(alpaka::Vec<TDim, TVal>, TKernelFnObj, TArgs ...)’
   74 |             __global__ void uniformCudaHipKernel(
      | ^           ~~~~~~~~

@j-stephan
Copy link
Member

Thanks for reporting this. Which CUDA version are you using here?

@psychocoderHPC
Copy link
Member

@fwyzard I can reproduce the error with your mini example but currently do not understand why it is failing.

@psychocoderHPC
Copy link
Member

If I change the alpaka code to be explicit and except only the signature for the vector add all is working.

diff --git a/example/vectorAdd/src/vectorAdd.cpp b/example/vectorAdd/src/vectorAdd.cpp
index 3b5de083a7..7434805164 100644
--- a/example/vectorAdd/src/vectorAdd.cpp
+++ b/example/vectorAdd/src/vectorAdd.cpp
@@ -163,13 +163,10 @@ auto main() -> int
     VectorAddKernel kernel;
 
     // Create the kernel execution task.
-    auto const taskKernel = alpaka::createTaskKernel<Acc>(
-        workDiv,
-        kernel,
-        alpaka::getPtrNative(bufAccA),
-        alpaka::getPtrNative(bufAccB),
-        alpaka::getPtrNative(bufAccC),
-        numElements);
+    Data const *  const    pBufAccA(alpaka::getPtrNative(bufAccA));
+    Data const*   const  pBufAccB(alpaka::getPtrNative(bufAccB));
+    Data*  __restrict__ const  pBufAccC(alpaka::getPtrNative(bufAccC));
+    auto const taskKernel = alpaka::createTaskKernel<Acc>(workDiv, kernel, pBufAccA, pBufAccB, pBufAccC, numElements);
 
     // Enqueue the kernel execution task
     {
diff --git a/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp b/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp
index 6af0b62bfe..42632db89f 100644
--- a/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp
+++ b/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp
@@ -74,7 +74,11 @@ namespace alpaka
             __global__ void uniformCudaHipKernel(
                 Vec<TDim, TIdx> const threadElemExtent,
                 TKernelFnObj const kernelFnObj,
+#if 0
                 TArgs... args)
+#else
+                uint32_t const * const a,uint32_t const * const b,uint32_t* const  c,size_t s)
+#endif
             {
 #    if BOOST_ARCH_PTX && (BOOST_ARCH_PTX < BOOST_VERSION_NUMBER(2, 0, 0))
 #        error "Device capability >= 2.0 is required!"
@@ -83,12 +87,11 @@ namespace alpaka
                 const TAcc acc(threadElemExtent);
 
 // with clang it is not possible to query std::result_of for a pure device lambda created on the host side
-#    if !(BOOST_COMP_CLANG_CUDA && BOOST_COMP_CLANG)
-                static_assert(
-                    std::is_same<decltype(kernelFnObj(const_cast<TAcc const&>(acc), args...)), void>::value,
-                    "The TKernelFnObj is required to return void!");
-#    endif
+#if 0
                 kernelFnObj(const_cast<TAcc const&>(acc), args...);
+#else
+                kernelFnObj(const_cast<TAcc const&>(acc), a,b,c,s);
+#endif
             }
 
             template<typename TDim, typename TIdx>

IMO the error is not because __restrict__ and normal pointers did not match in their signature. Do I oversee something in my diff?

@psychocoderHPC
Copy link
Member

@fwyzard Is there a benefit to declare the pointer __restrict__ on the host side before the kernel is called? Is it not enough to mark the parameters of the kernel (functor) with __restrict__?

@psychocoderHPC
Copy link
Member

I wrote my own remove_restrict and patched alpaka and the mini app is working.

diff --git a/example/vectorAdd/src/vectorAdd.cpp b/example/vectorAdd/src/vectorAdd.cpp
index 3b5de083a7..7434805164 100644
--- a/example/vectorAdd/src/vectorAdd.cpp
+++ b/example/vectorAdd/src/vectorAdd.cpp
@@ -163,13 +163,10 @@ auto main() -> int
     VectorAddKernel kernel;
 
     // Create the kernel execution task.
-    auto const taskKernel = alpaka::createTaskKernel<Acc>(
-        workDiv,
-        kernel,
-        alpaka::getPtrNative(bufAccA),
-        alpaka::getPtrNative(bufAccB),
-        alpaka::getPtrNative(bufAccC),
-        numElements);
+    Data const *  const    pBufAccA(alpaka::getPtrNative(bufAccA));
+    Data const*   const  pBufAccB(alpaka::getPtrNative(bufAccB));
+    Data*  __restrict__ const  pBufAccC(alpaka::getPtrNative(bufAccC));
+    auto const taskKernel = alpaka::createTaskKernel<Acc>(workDiv, kernel, pBufAccA, pBufAccB, pBufAccC, numElements);
 
     // Enqueue the kernel execution task
     {
diff --git a/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp b/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp
index 6af0b62bfe..ee98ba31b3 100644
--- a/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp
+++ b/include/alpaka/kernel/TaskKernelGpuUniformCudaHipRt.hpp
@@ -66,6 +66,18 @@ namespace alpaka
 {
     namespace uniform_cuda_hip
     {
+        template<typename A>
+        struct remove_restrict
+        {
+            using type = A;
+        };
+
+        template<typename A>
+        struct remove_restrict<A __restrict__>
+        {
+            using type = A;
+        };
+
         namespace detail
         {
             //! The GPU CUDA/HIP kernel entry point.
@@ -74,7 +86,7 @@ namespace alpaka
             __global__ void uniformCudaHipKernel(
                 Vec<TDim, TIdx> const threadElemExtent,
                 TKernelFnObj const kernelFnObj,
-                TArgs... args)
+                typename remove_restrict<TArgs>::type... args)
             {
 #    if BOOST_ARCH_PTX && (BOOST_ARCH_PTX < BOOST_VERSION_NUMBER(2, 0, 0))
 #        error "Device capability >= 2.0 is required!"

I will provide an PR.

psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 18, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 18, 2021

@fwyzard Is there a benefit to declare the pointer __restrict__ on the host side before the kernel is called?

Probably not :-)
Checking if it makes any difference is still on my to do list ...

@fwyzard
Copy link
Contributor Author

fwyzard commented Nov 18, 2021

I can confirm the #1474 fixes the compilation problem with GCC 10.

fwyzard pushed a commit to cms-patatrack/alpaka that referenced this issue Nov 19, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
fwyzard pushed a commit to cms-patatrack/alpaka that referenced this issue Nov 23, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 24, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 25, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 25, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 26, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 26, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 26, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 26, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit to psychocoderHPC/alpaka that referenced this issue Nov 29, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
j-stephan pushed a commit that referenced this issue Nov 29, 2021
fix #1472

Provide a type trait to remove __restrict__ from a type.
fwyzard pushed a commit to cms-patatrack/alpaka that referenced this issue Dec 5, 2021
Fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
j-stephan pushed a commit to j-stephan/alpaka that referenced this issue Dec 7, 2021
fix alpaka-group#1472

Provide a type trait to remove __restrict__ from a type.
psychocoderHPC added a commit that referenced this issue Dec 8, 2021
fix #1472

Provide a type trait to remove __restrict__ from a type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants