Use of `std::sin` and `std::cos` in device code generates unwanted FP64 instructions #337

stephenswat · 2023-03-09T15:07:15Z

@krasznaa has recently been on a crusade to make traccc work with his non-FP64-compatible GPU (see e.g. #333 and #335). Instead of hunting these errors down manually, we can do this automatically (see #336). However, the way we have decided to program traccc and its dependencies (in particular detray) will make it difficult to completely eliminate the slow 64-bit instructions. Consider the following source code that is generated in fitting_algorithm.ptx:

.func  (.param .b32 func_retval0) cosf(
    .param .b32 cosf_param_0
)
{
    .local .align 4 .b8     __local_depot1092[28];
    .reg .b64   %SP;
    .reg .b64   %SPL;
    .reg .pred  %p<22>;
    .reg .f32   %f<44>;
    .reg .b32   %r<69>;
    .reg .f64   %fd<3>;
    .reg .b64   %rd<28>;


    mov.u64     %SPL, __local_depot1092;
    cvta.local.u64  %SP, %SPL;
    ld.param.f32    %f16, [cosf_param_0];
    bra.uni     $L__BB1092_1;

   ...

$L__BB1092_13:
    mov.u32     %r28, %r66;
    mov.u32     %r27, %r65;
    mov.u32     %r26, %r64;
    cvt.u64.u32     %rd15, %r27;
    shl.b64     %rd16, %rd15, 32;
    cvt.u64.u32     %rd17, %r28;
    or.b64      %rd18, %rd16, %rd17;
    cvt.rn.f64.s64  %fd1, %rd18;
    mul.f64     %fd2, %fd1, 0d3BF921FB54442D19;
    cvt.rn.f32.f64  %f3, %fd2;
    setp.ne.s32     %p13, %r26, 0;
    not.pred    %p14, %p13;
    mov.f32     %f38, %f3;
    @%p14 bra   $L__BB1092_15;
    bra.uni     $L__BB1092_14;

It is not hard to identify that the 64-bit floating point instructions are being generated as a result of the use of std::sin. There is a similar case with the use of std::cos. The canonical way of implementing this in CUDA, if single-precision does indeed provide sufficient precision, is to use the __sinf compiler intrinsic. Currently, we don't really have a way of controlling the implementation that is used, as this is abstracted away behind detray and algebra-plugins.

The text was updated successfully, but these errors were encountered:

krasznaa · 2023-03-09T15:26:38Z

Bull#$^... 😕 Double- and triple-check that we are not mistakenly providing double inputs to those trigonometric functions. I very much suspect that we are.

If anything, we may want to switch to using std::sinf and friends.

But in the end we shouldn't be using any of those. We'll need to make all of them use the trigonometric functions from:

https://github.com/acts-project/algebra-plugins/blob/main/math/common/include/algebra/math/common.hpp

sin/cos is not there yet, but there are for instance a number of places in our code where std::sqrt is used instead of algebra::math::sqrt.

stephenswat · 2023-03-09T15:36:25Z

I invite you to compile the following extremely trivial CUDA code and inspect the PTX:

#include <cmath>

__global__ void sins(float * f) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;

    f[tid] = std::sin(f[tid]);
}

nvcc -c --keep test.cu && cat test.ptx and you should find those 64-bit floating point instructions.

Then please consider the following lines:

https://github.com/acts-project/detray/blob/main/core/include/detray/tracks/detail/track_helper.hpp#L89
https://github.com/acts-project/detray/blob/main/core/include/detray/definitions/math.hpp#L24

That should clear up what's happening.

Looks like it's detray and not algebra plugins, but potato/potato.

stephenswat · 2023-03-09T15:40:37Z

Pinging @niermann999 @beomki-yeo.

stephenswat · 2023-03-09T17:48:53Z

Something else to mention: this effect goes away when using --use_fast_math. We'll need to check whether using pure 32-bit trigonometry provides us with the precision we need. If it does, we should consider whether we want to switch to using intrinsic trig functions whether we want to enable non-compliant math.

stephenswat · 2023-03-09T19:37:42Z

Okay, so after looking into this a bit more, the use of double-precision in the single-precision trigonometry functions is a relatively uncommon branch to cover subnormal floating point numbers. The difference in performance between std::sin and __sinf is small but certainly present; taking the sine of one billion floating point numbers on an A6000 is approximately 3% faster using the intrinsic. However, this is in a very simple kernel. Interestingly, the implementation using the standard library may have significantly higher register pressure. I cannot currently say anything about the liveness of those registers, but this effect could potentially impact the performance of trigonometry functions in code that is already bound by register pressure. According to the PTX standard, the use of the sine and cosine have a maximum absolute error of 2^-20.9 or 5.1 × 10^-7.

The way in which we proceed here should depend on our attitude towards the use of double-precision floating point numbers and our willingness to sacrifice performance for convenience. If we wish to completely eliminate double-precision floating point numbers, we should exclusively use the approximation intrinsics. For performance, this would also be preferable, but this would involve some work to incorporate it into detray and algebra plugins. The alternative would be to enable fast math, but this may have other unintended side-effects.

stephenswat added cuda Changes related to CUDA bug Something isn't working labels Mar 9, 2023

stephenswat added performance Performance-relevant changes and removed bug Something isn't working labels Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of `std::sin` and `std::cos` in device code generates unwanted FP64 instructions #337

Use of `std::sin` and `std::cos` in device code generates unwanted FP64 instructions #337

stephenswat commented Mar 9, 2023 •

edited

Loading

krasznaa commented Mar 9, 2023

stephenswat commented Mar 9, 2023 •

edited

Loading

stephenswat commented Mar 9, 2023

stephenswat commented Mar 9, 2023

stephenswat commented Mar 9, 2023 •

edited

Loading

Use of std::sin and std::cos in device code generates unwanted FP64 instructions #337

Use of std::sin and std::cos in device code generates unwanted FP64 instructions #337

Comments

stephenswat commented Mar 9, 2023 • edited Loading

krasznaa commented Mar 9, 2023

stephenswat commented Mar 9, 2023 • edited Loading

stephenswat commented Mar 9, 2023

stephenswat commented Mar 9, 2023

stephenswat commented Mar 9, 2023 • edited Loading

Use of `std::sin` and `std::cos` in device code generates unwanted FP64 instructions #337

Use of `std::sin` and `std::cos` in device code generates unwanted FP64 instructions #337

stephenswat commented Mar 9, 2023 •

edited

Loading

stephenswat commented Mar 9, 2023 •

edited

Loading

stephenswat commented Mar 9, 2023 •

edited

Loading