-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Enhancement]Support select_if in arm #53093
base: main
Are you sure you want to change the base?
[Enhancement]Support select_if in arm #53093
Conversation
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
} | ||
|
||
template <typename T, bool left_const = false, bool right_const = false> | ||
inline void neon_select_if_common_implement(uint8_t*& selector, T*& dst, const T*& a, const T*& b, int size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need a const T*&
? can it be simplify as a const T*
or even const void*
?
vec_a = vld1q_u16(reinterpret_cast<const uint16_t*>(a) + i * 8); | ||
} else { | ||
// vdupq_n_u16: Copy a 16-bit value to all elements in the register | ||
vec_a = vdupq_n_u16(*reinterpret_cast<const uint16_t*>(a)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's a const, you don't need to populate the register many times ?
} | ||
|
||
uint8x16_t index = {0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1}; | ||
uint8x16_t mask = vqtbl1q_u8(loaded_mask, index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this piece of code is almost same exception the data_size parameter, it's better to extract the common part
Signed-off-by: before-Sunrise <[email protected]>
Signed-off-by: before-Sunrise <[email protected]>
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
Why I'm doing:
What I'm doing:
Benchmark for uint8_t (100000000 elements):
SIMD time: 15.9133 ms
Non-SIMD time: 290.615 ms
Speedup: 18.2624x
Benchmark for int16_t (100000000 elements):
SIMD time: 27.9742 ms
Non-SIMD time: 295.018 ms
Speedup: 10.5461x
Benchmark for int32_t (100000000 elements):
SIMD time: 51.5047 ms
Non-SIMD time: 291.931 ms
Speedup: 5.66804x
Benchmark for int64_t (100000000 elements):
SIMD time: 98.8005 ms
Non-SIMD time: 290.183 ms
Speedup: 2.93706x
Benchmark for double (100000000 elements):
SIMD time: 97.1446 ms
Non-SIMD time: 291.176 ms
Speedup: 2.99734x
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: