Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing and Benchmarking #1

Open
orkolorko opened this issue Nov 8, 2022 · 5 comments
Open

Testing and Benchmarking #1

orkolorko opened this issue Nov 8, 2022 · 5 comments

Comments

@orkolorko
Copy link
Owner

This is an issue to test and Benchmark setting rounding modes through llvm intrinsics.
At the moment the test set checks that directed rounding is working; the test was successful on Linux, Windows and Mac Os;
the CI package is testing on all of these platforms.

Further tests

  1. Check consistency with subnormal numbers
  2. Test the different behaviour of RoundToNearest and RoundToZero
  3. Check how llvm_setrounding behaves with respect to GPU (since it is using llvm intrinsics, this may be well behaved)

Benchmarks

  1. Test against SetRounding.jl (I expect the performance to be essentially the same)
  2. Test against RoundingEmulator.jl (probably faster)
  3. Implement and test the various options in Interval arithmetic with fixed rounding
    mode
    ; I think their algorithm with only rounding to nearest is similar to the one in RoundingEmulator.jl
@lucaferranti
Copy link
Collaborator

lucaferranti commented Nov 9, 2022

Test against RoundingEmulator.jl (probably faster)

For individual operations, e.g. 0.1 + 0.1 I expect rounding emulator to be faster. However, I expect that for vector operations (e.g. sum two vectors of length 1000) the change rounding mode method will become faster. (You change only once and then use floating point operations, rounding emulator has at least 4x operations)

I think their algorithm with only rounding to nearest is similar to the one in RoundingEmulator.jl

The algorithms in round to nearest of the paper actually emulate the use of prevfloat and nextfloat. RoundingEmulator.jl is based on error-free transformations (EFT).

@lucaferranti
Copy link
Collaborator

lucaferranti commented Nov 9, 2022

also I think testing / benchmarking when using multiple threads would be interesting (to my understanding, one issue of the deprecated setrounding is that it wasn't thread safe)

@lucaferranti
Copy link
Collaborator

cc @lbenet , I think you'll find this very interesting

@orkolorko
Copy link
Owner Author

I think thread safety is going to be platform dependent; from the LLVM discussion

This change implements only IR intrinsic. Lowering it to machine code is
target-specific and will be implemented latter.

Some architectures, as AMD and NVIDIA GPU, and the AVX-512 registers implement a static rounding mode Intel® Architecture
Instruction Set Extensions Programming
Reference
, page 2-8, i.e., the rounding mode is specified instruction by instruction. I would like to understand the behavior of LLVM in these cases, i.e., if setting the rounding mode through the compiler intrinsic leads to the static rounding mode to be compiled into code.

What worries me the most is that having these two coexistent mechanisms could lead to worrying problems:

  • if the rounding mode is set on the processor once and for all, we have our expected behavior, the question is about the granularity of the control (is it set on a thread basis? on a core basis? on a processor basis? on a machine basis?)
  • if the rounding mode is compiled into code by using static rounding mode (I don't know if more recent machines have FPGA that support this) everything should be thread safe but this would be a problem for Rump matrix multiplication since we are not recompiling LAPACK but using it as an external library
  • even worse, we could have a mix of the behavior, depending on the platform used (or even the registries inside the processor), i.e., we could have different behaviour if LAPACK is compiled with support to AVX512 registers or not

@orkolorko
Copy link
Owner Author

orkolorko commented Nov 10, 2022

Hi @lucaferranti, I implemented some Benchmarks (some of them are now run as test by the CI).
Some good news:

  • RoundingEmulator.jl is faster on a single operation, as you predicted and slower on a bunch of operations
  • changing rounding mode seems to be thread safe: the test was done by assigning different rounding modes to different threads and check that every thread rounded the way it was expected to round, to check that changing rounding mode on one thread does not interfere with the remaining threads. Correction: the tests are failing on Windows when multithreading is involved the tests run fine on my local windows machine, maybe some problem in the environment for the tests in CI (maybe the wrong environment variable, or maybe some virtual machine issues)
  • I tested BLAS.dot to see if changing rounding mode changes the rounding mode of the BLAS library, and it behaves as expected

The benchmarks are scripts in the benchmark directory, to be called from the command line; there are new testsets in runtest.jl to check the multithreading behavior and the behavior of BLAS.

CUDA: It is possible to get the rounding mode by using the llvm.flt.rounds intrinsic, but the llvm.set.rounding intrinsic is not working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants