Testing and Benchmarking #1

orkolorko · 2022-11-08T22:33:11Z

This is an issue to test and Benchmark setting rounding modes through llvm intrinsics.
At the moment the test set checks that directed rounding is working; the test was successful on Linux, Windows and Mac Os;
the CI package is testing on all of these platforms.

Further tests

Check consistency with subnormal numbers
Test the different behaviour of RoundToNearest and RoundToZero
Check how llvm_setrounding behaves with respect to GPU (since it is using llvm intrinsics, this may be well behaved)

Benchmarks

Test against SetRounding.jl (I expect the performance to be essentially the same)
Test against RoundingEmulator.jl (probably faster)
Implement and test the various options in Interval arithmetic with fixed rounding
mode; I think their algorithm with only rounding to nearest is similar to the one in RoundingEmulator.jl

The text was updated successfully, but these errors were encountered:

lucaferranti · 2022-11-09T06:39:18Z

Test against RoundingEmulator.jl (probably faster)

For individual operations, e.g. 0.1 + 0.1 I expect rounding emulator to be faster. However, I expect that for vector operations (e.g. sum two vectors of length 1000) the change rounding mode method will become faster. (You change only once and then use floating point operations, rounding emulator has at least 4x operations)

I think their algorithm with only rounding to nearest is similar to the one in RoundingEmulator.jl

The algorithms in round to nearest of the paper actually emulate the use of prevfloat and nextfloat. RoundingEmulator.jl is based on error-free transformations (EFT).

lucaferranti · 2022-11-09T06:46:01Z

also I think testing / benchmarking when using multiple threads would be interesting (to my understanding, one issue of the deprecated setrounding is that it wasn't thread safe)

lucaferranti · 2022-11-09T12:24:25Z

cc @lbenet , I think you'll find this very interesting

orkolorko · 2022-11-10T03:46:37Z

I think thread safety is going to be platform dependent; from the LLVM discussion

This change implements only IR intrinsic. Lowering it to machine code is
target-specific and will be implemented latter.

Some architectures, as AMD and NVIDIA GPU, and the AVX-512 registers implement a static rounding mode Intel® Architecture
Instruction Set Extensions Programming
Reference, page 2-8, i.e., the rounding mode is specified instruction by instruction. I would like to understand the behavior of LLVM in these cases, i.e., if setting the rounding mode through the compiler intrinsic leads to the static rounding mode to be compiled into code.

What worries me the most is that having these two coexistent mechanisms could lead to worrying problems:

if the rounding mode is set on the processor once and for all, we have our expected behavior, the question is about the granularity of the control (is it set on a thread basis? on a core basis? on a processor basis? on a machine basis?)
if the rounding mode is compiled into code by using static rounding mode (I don't know if more recent machines have FPGA that support this) everything should be thread safe but this would be a problem for Rump matrix multiplication since we are not recompiling LAPACK but using it as an external library
even worse, we could have a mix of the behavior, depending on the platform used (or even the registries inside the processor), i.e., we could have different behaviour if LAPACK is compiled with support to AVX512 registers or not

orkolorko · 2022-11-10T05:47:46Z

Hi @lucaferranti, I implemented some Benchmarks (some of them are now run as test by the CI).
Some good news:

RoundingEmulator.jl is faster on a single operation, as you predicted and slower on a bunch of operations
changing rounding mode seems to be thread safe: the test was done by assigning different rounding modes to different threads and check that every thread rounded the way it was expected to round, to check that changing rounding mode on one thread does not interfere with the remaining threads. Correction: ~~the tests are failing on Windows when multithreading is involved~~ the tests run fine on my local windows machine, maybe some problem in the environment for the tests in CI (maybe the wrong environment variable, or maybe some virtual machine issues)
I tested BLAS.dot to see if changing rounding mode changes the rounding mode of the BLAS library, and it behaves as expected

The benchmarks are scripts in the benchmark directory, to be called from the command line; there are new testsets in runtest.jl to check the multithreading behavior and the behavior of BLAS.

CUDA: It is possible to get the rounding mode by using the llvm.flt.rounds intrinsic, but the llvm.set.rounding intrinsic is not working.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing and Benchmarking #1

Testing and Benchmarking #1

orkolorko commented Nov 8, 2022

lucaferranti commented Nov 9, 2022 •

edited

Loading

lucaferranti commented Nov 9, 2022 •

edited

Loading

lucaferranti commented Nov 9, 2022

orkolorko commented Nov 10, 2022

orkolorko commented Nov 10, 2022 •

edited

Loading

Testing and Benchmarking #1

Testing and Benchmarking #1

Comments

orkolorko commented Nov 8, 2022

Further tests

Benchmarks

lucaferranti commented Nov 9, 2022 • edited Loading

lucaferranti commented Nov 9, 2022 • edited Loading

lucaferranti commented Nov 9, 2022

orkolorko commented Nov 10, 2022

orkolorko commented Nov 10, 2022 • edited Loading

lucaferranti commented Nov 9, 2022 •

edited

Loading

lucaferranti commented Nov 9, 2022 •

edited

Loading

orkolorko commented Nov 10, 2022 •

edited

Loading