Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for simde #7118

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open

Conversation

Calandracas606
Copy link

I noticed that the performance of rawtherapee on aarch64 was somewhat disappointing.

This is understandable, since it is highly optimized with SSE2 intrinsics.

I have attempted to improve the performance by using the simde library, which provides portable SIMD intrinsics, and translates them to native intrinsics for a different platform, in my case, arm NEON. This will allow other platforms to make use of the tremendous effort rawtherapee has put into writing highly optimized algorithms

I have observed > 2x performance improvements in some scenarios.

For example on my M2 Macbook Air, running linux, with a 50MP ILCE-1 raw image:

  • amaze demosaic (before) ~800ms
  • amaze demosaic (after) ~250ms

Admittedly, this patch is pretty hacky, and mostly just a proof of concept.

The replacement of #ifdef __SSE2__ with #if defined(__SSE2__) || defined(RT_SIMDE) is quite ham fisted, and should probably be replaced with a much more concise macro.

If this is something that you are interested in, please let me know what changes need to be made to the PR (I'm sure there will be many)

@Benitoite
Copy link
Contributor

Benitoite commented Jun 28, 2024

Would this also be applicable to macOS? I tried compiling with Apple Clang but get about a dozen of the following:

In file included from /Users/rb/repo-rt/rtengine/canon_cr3_decoder.cc:67:
In file included from /Users/rb/repo-rt/rtengine/dcraw.h:25:
In file included from /Users/rb/repo-rt/rtengine/myfile.h:26:
In file included from /Users/rb/repo-rt/rtengine/opthelper.h:26:
In file included from /Users/rb/repo-rt/rtengine/sleefsseavx.h:18:
/Users/rb/repo-rt/rtengine/helpersse2.h:264:12: error: argument to '__builtin_neon_vshrq_n_v' must be a constant integer
  264 |     return _mm_srli_epi32(x, c);
      |            ^~~~~~~~~~~~~~~~~~~~
/opt/homebrew/include/simde/x86/sse2.h:6312:35: note: expanded from macro '_mm_srli_epi32'
 6312 |   #define _mm_srli_epi32(a, imm8) simde_mm_srli_epi32(a, imm8)
      |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/homebrew/include/simde/x86/sse2.h:6291:11: note: expanded from macro 'simde_mm_srli_epi32'
 6291 |           vshrq_n_u32(simde__m128i_to_neon_u32(a), ((imm8) & 31) | (((imm8) & 31) == 0))))
      |           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/16/include/arm_neon.h:25271:24: note: expanded from macro 'vshrq_n_u32'
 25271 |   __ret = (uint32x4_t) __builtin_neon_vshrq_n_v((int8x16_t)__s0, __p1, 50); \
       |                        ^                                         ~~~~

@Calandracas606
Copy link
Author

Would this also be applicable to macOS? I tried compiling with Apple Clang but get about a dozen of the following:

I have originally only tested with gcc on linux, both x86_64 and aarch64

Using clang18 on aarch64 linux I'm getting the same errors as you. I will need to investigate further

@Calandracas606
Copy link
Author

gcc vs clang: https://godbolt.org/z/E31zfMMvj

@Lawrence37 Lawrence37 added the scope: performance Performance issues and improvements label Jun 29, 2024
@Calandracas606
Copy link
Author

Would this also be applicable to macOS? I tried compiling with Apple Clang ...

It looks like a fix for this on clang would be non-trivial. I'm not sure why gcc allows __builtin_neon_vshrq_n_v to be called with a non-const argument, but clang does not.

I will look at sleef for an alternate aarch64 native implementation of the required functions.

If this is something that Rawtherapee would be interested using, I think the best approach would be to make this gcc only (for now). That would keep this PR simple, and a future PR could add clang support.

Thoughts?

@Benitoite
Copy link
Contributor

gcc allows __builtin_neon_vshrq_n_v to be called with a non-const argument, but clang does not.

Clang is following the NEON intrinsic definition, which is for passing an immediate, which means compile-time constant (not an integer). Each of those particular failing neon calling functions would, if I understand it correctly, require passing those rejected integer values as compile-time constants.
Not sure if this error is affecting every call in all of the modified files.

cf
https://developer.arm.com/documentation/den0018/a/NEON-Intrinsics-Reference/NEON-intrinsics-description.
facebookresearch/faiss#1880 (comment)
https://developer.arm.com/documentation/den0018/a/NEON-Intrinsics/Variables-and-constants-in-NEON-code

@Benitoite
Copy link
Contributor

The macOS system headers and gcc don't really get along any more. I tried with all the SDKs I could find for Apple Silicon but none will build out. In the recent past RT was buildable with gcc on mac, but the succession of major macos versions have outpaced the fine tuning required on apple's part to make the system headers compatible with gcc.

@Calandracas606
Copy link
Author

Calandracas606 commented Jul 5, 2024

Not sure if this error is affecting every call in all of the modified files.

it is only affecting two functions, vslli and vsrli

it turns out that those two are already being called with compile-time constants, so as a quick hack I've redefined them as macros, and it now builds successfully for me on clang.

clang version 18.1.8 Target: aarch64-unknown-linux-musl

@Benitoite
Copy link
Contributor

@Calandracas606 much appreciated. I will be resuming my Apple Silicon testing in mid-August. In the meantime it might be worthwhile to have the .github CI build an arm64 macOS app in addition to the x86_64 for build testing.

@Benitoite
Copy link
Contributor

Benitoite commented Jul 19, 2024

PR generated to add arm64 to macOS CI action:
#7132

A successful arm64 macOS clang build with this patch:
https://github.com/Benitoite/RawTherapee/actions/runs/10002276125/job/27647490804

macOS Build

@Lawrence37 Lawrence37 added this to the v6.0 milestone Jul 28, 2024
@Lawrence37 Lawrence37 modified the milestones: v6.0, v5.12 Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope: performance Performance issues and improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants