Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable computation of CUDA global kernels derivative in reverse mode #1059

Merged
merged 12 commits into from
Sep 5, 2024

Conversation

kchristin22
Copy link
Collaborator

This PR closes #1036. The CUDA kernels can now be an arg to a clad::gradient function and the derived kernels can be executed successfully.

Details on issues faced and the final approach can be found here: #1036 (comment).

Old PR's approach involved the use of CUDA API libs to compile and load the computed kernel on the GPU.

@kchristin22 kchristin22 self-assigned this Aug 25, 2024
Copy link

codecov bot commented Aug 25, 2024

Codecov Report

Attention: Patch coverage is 92.59259% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.38%. Comparing base (685bcbf) to head (c42fd53).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
include/clad/Differentiator/Differentiator.h 84.61% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
- Coverage   94.40%   94.38%   -0.02%     
==========================================
  Files          55       55              
  Lines        8430     8445      +15     
==========================================
+ Hits         7958     7971      +13     
- Misses        472      474       +2     
Files with missing lines Coverage Δ
lib/Differentiator/DiffPlanner.cpp 98.77% <100.00%> (+0.02%) ⬆️
lib/Differentiator/ReverseModeVisitor.cpp 97.92% <100.00%> (+<0.01%) ⬆️
include/clad/Differentiator/Differentiator.h 80.00% <84.61%> (-3.34%) ⬇️
Files with missing lines Coverage Δ
lib/Differentiator/DiffPlanner.cpp 98.77% <100.00%> (+0.02%) ⬆️
lib/Differentiator/ReverseModeVisitor.cpp 97.92% <100.00%> (+<0.01%) ⬆️
include/clad/Differentiator/Differentiator.h 80.00% <84.61%> (-3.34%) ⬇️

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@@ -210,10 +235,34 @@
printf("CladFunction is invalid\n");
return static_cast<return_type_t<F>>(return_type_t<F>());
}
if (m_CUDAkernel) {
printf("Use execute_kernel() for global CUDA kernels\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: do not call c-style vararg functions [cppcoreguidelines-pro-type-vararg]

        printf("Use execute_kernel() for global CUDA kernels\n");
        ^

lib/Differentiator/DiffPlanner.cpp Outdated Show resolved Hide resolved
int numArgs = static_cast<int>(call->getNumArgs());
if (numArgs > 4) {
auto kernelArgIdx = numArgs - 1;
auto cudaKernelFlag = new (C) CXXBoolLiteralExpr(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: initializing non-owner 'CXXBoolLiteralExpr *' with a newly created 'gsl::owner<>' [cppcoreguidelines-owning-memory]

      auto cudaKernelFlag = new (C) CXXBoolLiteralExpr(
      ^

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

lib/Differentiator/DiffPlanner.cpp Outdated Show resolved Hide resolved
@kchristin22
Copy link
Collaborator Author

kchristin22 commented Aug 27, 2024

Codecov Report

Attention: Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.

Project coverage is 94.32%. Comparing base (6f4b081) to head (610850f).
Report is 4 commits behind head on master.

Files Patch % Lines
include/clad/Differentiator/Differentiator.h 75.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1059      +/-   ##
==========================================
- Coverage   94.34%   94.32%   -0.02%     
==========================================
  Files          55       55              
  Lines        8311     8323      +12     
==========================================
+ Hits         7841     7851      +10     
- Misses        470      472       +2     

Files Coverage Δ
lib/Differentiator/DiffPlanner.cpp 98.76% <100.00%> (+0.01%) ⬆️
lib/Differentiator/VisitorBase.cpp 96.88% <100.00%> (+0.01%) ⬆️
include/clad/Differentiator/Differentiator.h 80.00% <75.00%> (-3.34%) ⬇️
Files Coverage Δ
lib/Differentiator/DiffPlanner.cpp 98.76% <100.00%> (+0.01%) ⬆️
lib/Differentiator/VisitorBase.cpp 96.88% <100.00%> (+0.01%) ⬆️
include/clad/Differentiator/Differentiator.h 80.00% <75.00%> (-3.34%) ⬇️

I have added such a test in GradientKernels.cu to cover these two lines indicated in Differentiator.h:
test.execute(d_a, d_square);
But it doesn't trace it. Does it need to be included in an already present test file?

@vgvassilev
Copy link
Owner

I have added such a test in GradientKernels.cu to cover these two lines indicated in Differentiator.h:
test.execute(d_a, d_square);
But it doesn't trace it. Does it need to be included in an already present test file?

I think the Differentiator.h is difficult to test right now. Let's ignore this report as we know it's covered. I need to figure out how to run the cuda tests, too.

@kchristin22
Copy link
Collaborator Author

I think the Differentiator.h is difficult to test right now. Let's ignore this report as we know it's covered.

Agreed.

I need to figure out how to run the cuda tests, too.

You mean locally?

@vgvassilev
Copy link
Owner

I need to figure out how to run the cuda tests, too.

You mean locally?

No, adding a bot that has a cuda device to run the code we develop. That's an action item on me :)

typename std::enable_if<EnablePadding, bool>::type = true>
CUDA_HOST_DEVICE void
execute_with_default_args(list<Rest...>, F f, list<fArgTypes...>, dim3 grid,
dim3 block, size_t shared_mem, cudaStream_t stream,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dim3 block, size_t shared_mem, cudaStream_t stream,
dim3 block, size_t shared_mem, cudaStream_t stream,

Can we wrap that in a CUDA-specific macro, say CUDA_ARGS, which gets expanded only in cuda mode.

This will allow you to wrap the cudaLaunchKernel and friends with #ifdef __CUDACC__ and allow duplicating the overly complicated templates already...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can go even deeper and assume that in cuda mode Args has size_t shared_mem, cudaStream_t stream and extract it before forwarding the rest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this pull-request. Overall, this looks great!

test/CUDA/GradientKernels.cu Show resolved Hide resolved
@@ -210,10 +235,34 @@ inline CUDA_HOST_DEVICE unsigned int GetLength(const char* code) {
printf("CladFunction is invalid\n");
return static_cast<return_type_t<F>>(return_type_t<F>());
}
if (m_CUDAkernel) {
printf("Use execute_kernel() for global CUDA kernels\n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably assert-out if users use execute instead of execute_kernel for CUDA kernels. 'printf' seems to be too subtle for reporting such a big error. @vgvassilev What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that a programmatic error or user error. We can use assert/abort if the error is programmer error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a user error basically. They should use the appropriate execute function depending on whether their function is a CUDA kernel or not.

template <typename... Args, class FnType = CladFunctionType>
typename std::enable_if<!std::is_same<FnType, NoFunction*>::value,
return_type_t<F>>::type
execute_kernel(dim3 grid, dim3 block, size_t shared_mem,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be a good idea to provide overloads for execute_kernel that takes default values for shared_mem and stream args?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can use the default values of 0 and nullptr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I needed to use constexpr for this, which is c++17. So cuda tests cannot run with an older c++ version. @vgvassilev is that too big of a problem?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need constexpr for having default values for shared_mem and stream?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the signature for execute_kernel with default args is: dim3, dim3, Args&&...
the function chosen when the args should not be default was the one above instead of the correct: dim3, dim3, size_t, cudaStream_t, Args&&...

Hence, I kept only the signature mentioned first and check if a cudaStream_t arg is included in Args, otherwise I use default values to call execute_with_default_args

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vgvassilev I think it's better to not require the grid parameters to be evaluated at compile time. CUDA doesn't require that in general. So, we can't do this:
execute_kernel<grid, block, shared_mem, stream>(...)

lib/Differentiator/VisitorBase.cpp Outdated Show resolved Hide resolved
lib/Differentiator/DiffPlanner.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@parth-07 parth-07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@parth-07
Copy link
Collaborator

parth-07 commented Sep 5, 2024

@kchristin22 Can you please rebase the PR on top of master?

@vgvassilev vgvassilev merged commit 431791f into vgvassilev:master Sep 5, 2024
88 of 89 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Call to clad::gradient for CUDA kernels cannot be compiled for GPU
3 participants