-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of std::sin
and std::cos
in device code generates unwanted FP64 instructions
#337
Comments
Bull#$^... 😕 Double- and triple-check that we are not mistakenly providing If anything, we may want to switch to using But in the end we shouldn't be using any of those. We'll need to make all of them use the trigonometric functions from: sin/cos is not there yet, but there are for instance a number of places in our code where |
I invite you to compile the following extremely trivial CUDA code and inspect the PTX: #include <cmath>
__global__ void sins(float * f) {
int tid = blockIdx.x * blockDim.x + threadIdx.x;
f[tid] = std::sin(f[tid]);
}
Then please consider the following lines: https://github.com/acts-project/detray/blob/main/core/include/detray/tracks/detail/track_helper.hpp#L89 That should clear up what's happening. Looks like it's detray and not algebra plugins, but potato/potato. |
Pinging @niermann999 @beomki-yeo. |
Something else to mention: this effect goes away when using |
Okay, so after looking into this a bit more, the use of double-precision in the single-precision trigonometry functions is a relatively uncommon branch to cover subnormal floating point numbers. The difference in performance between The way in which we proceed here should depend on our attitude towards the use of double-precision floating point numbers and our willingness to sacrifice performance for convenience. If we wish to completely eliminate double-precision floating point numbers, we should exclusively use the approximation intrinsics. For performance, this would also be preferable, but this would involve some work to incorporate it into detray and algebra plugins. The alternative would be to enable fast math, but this may have other unintended side-effects. |
@krasznaa has recently been on a crusade to make traccc work with his non-FP64-compatible GPU (see e.g. #333 and #335). Instead of hunting these errors down manually, we can do this automatically (see #336). However, the way we have decided to program traccc and its dependencies (in particular detray) will make it difficult to completely eliminate the slow 64-bit instructions. Consider the following source code that is generated in
fitting_algorithm.ptx
:It is not hard to identify that the 64-bit floating point instructions are being generated as a result of the use of
std::sin
. There is a similar case with the use ofstd::cos
. The canonical way of implementing this in CUDA, if single-precision does indeed provide sufficient precision, is to use the__sinf
compiler intrinsic. Currently, we don't really have a way of controlling the implementation that is used, as this is abstracted away behind detray and algebra-plugins.The text was updated successfully, but these errors were encountered: