Unary and Binary functions trait #726

AdamGS · 2024-09-04T13:06:37Z

Introducing specialized traits for unary (like map) and binary (like &) operations on arrays. This will allow us to have specialized and ergonomic implementations for cases when we know the encoding of an array.

Benchmarks

As of 08bbd8d running on my M3 Max macbook

arrow_unary_add         time:   [228.22 µs 233.98 µs 239.93 µs]

vortex_unary_add        time:   [1.5322 µs 1.5469 µs 1.5604 µs]

arrow_binary_add        time:   [237.74 µs 240.93 µs 243.98 µs]

vortex_binary_add       time:   [93.821 µs 94.796 µs 95.918 µs]

vortex-array/src/elementwise.rs

AdamGS · 2024-09-04T14:27:58Z

vortex-array/benches/iter.rs

@@ -58,6 +61,30 @@ fn vortex_iter_flat(c: &mut Criterion) {
    });
 }

+fn vortex_binary_add(c: &mut Criterion) {


as of dc09067d, the results are (can be reproduced by running cargo bench --bench iter unary_add):

vortex_unary_add time: [662.83 µs 665.39 µs 668.35 µs] arrow_unary_add time: [199.29 µs 203.01 µs 207.45 µs]

so Arrow is faster, but its a much smaller difference than the previous iterators. I tried some approaches that used BytesMut and building everything through a buffer but that was much slower (1-2ms), still trying to restructure things so this isn't the final performance number.

robert3005 · 2024-09-04T15:20:45Z

Couple of things:

maybe we should have iterator over the underlying values ignoring validity? I think it is likely branch predicted perfectly but maybe there's a simpler way.
We also shouldn't force a flatten but instead do the per batch loop in the function, then you can control if you have a full batch or a partial one.
I think you will want a macro for getting iterator based on ptype so you avoid casting and boxing.
For binary functions if other is a scalar that can be captured in the closure and evaluated as unary.
One minor thing there's already TryFrom<&DType> for PType

AdamGS · 2024-09-04T16:35:51Z

Handling known fixed sizes gives us a nice boost, I also split the benchmarks to a new file. Seems like just removing all the bound checks yields a pretty nice boost.

arrow_unary_add         time:   [201.46 µs 204.01 µs 206.92 µs]
vortex_unary_add        time:   [197.59 µs 199.92 µs 202.51 µs]

arrow_binary_add        time:   [226.86 µs 229.16 µs 231.32 µs]
vortex_binary_add       time:   [438.23 µs 440.34 µs 442.35 µs]

robert3005 · 2024-09-04T16:42:57Z

nice, you got rid of vec push, that was the one other thing that I noticed initially that could have been better

vortex-array/src/elementwise.rs

AdamGS · 2024-09-05T10:32:48Z

pushed a version with a macro-based dispatch for the different iterators. ~~If anything, performance got worse, but I can't really tell if that's the issue because the variance is pretty high~~ found a version that miri accepts + perf is as good as its been with this approach.

robert3005 · 2024-09-05T11:26:38Z

Interesting I thought not boxing the iterator would be an improvement but maybe not… if there’s no difference in perf the simpler version might be better

AdamGS · 2024-09-05T11:46:40Z

Agreed. I'll try and find some additional optimizations here, but if I can't find anything significant I would rather maintain a large match statement over a macro.

robert3005 · 2024-09-05T12:03:48Z

I guess we already have one indirection when loading the data to decompress so another indirection to the iterator doesn’t fundamentally change the performance profile, ie not having completely stack allocated iterator is all the difference. Let me know if I can help but doubt we will find anything here

AdamGS · 2024-09-05T12:52:51Z

Never say never

AdamGS · 2024-09-05T14:37:46Z

no reason binary will be faster than unary :)

vortex_unary_add        time:   [1.4555 µs 1.4690 µs 1.4791 µs]

robert3005 · 2024-09-05T15:43:09Z

vortex-array/src/array/primitive/mod.rs

@@ -314,8 +316,141 @@ impl Array {
    }
 }

+// This is an arbitrary value, tried a few seems like this is a better value than smaller ones,
+// I assume there's some hardware dependency here but this seems to be good enough
+const CHUNK_SIZE: usize = 1024;


FWIW given fastlanes is 1024 this will likely be the best value overall

robert3005

Just a note about code to remove

vortex-dtype/src/dtype.rs

AdamGS added 3 commits September 4, 2024 14:05

maybe this is the way

950db8e

.

9f26b4d

.

351962e

AdamGS commented Sep 4, 2024

View reviewed changes

vortex-array/src/elementwise.rs Outdated Show resolved Hide resolved

Fix validity

2079385

AdamGS force-pushed the adamg/elementwise branch from 3f2adda to 2079385 Compare September 4, 2024 13:46

AdamGS added 4 commits September 4, 2024 14:48

Move things around

a20024d

rename

e25f52a

rename test

5158860

benchmarks

11eae3b

AdamGS commented Sep 4, 2024

View reviewed changes

small rename

5a8b9e4

AdamGS added 2 commits September 4, 2024 16:21

slightly better benchmark

e89f599

rename

dc09067

Better numbers and benchmarks

c63d246

AdamGS force-pushed the adamg/elementwise branch from da5c130 to c63d246 Compare September 4, 2024 16:38

robert3005 reviewed Sep 4, 2024

View reviewed changes

vortex-array/src/elementwise.rs Outdated Show resolved Hide resolved

AdamGS added 5 commits September 4, 2024 17:49

minor things

6bad84d

no need to handle scalars in

4369a62

so safe

38ae835

.

f375505

.

9545ffa

AdamGS added 2 commits September 5, 2024 11:42

.

29734b4

Added some safety comments

bd82517

better vectorized branch

7d1d510

Go back to a boxed iterator

e12d9f3

AdamGS marked this pull request as ready for review September 5, 2024 12:58

AdamGS changed the title ~~[WIP] unary/binary fn trait~~ Unary and Binary functions trait Sep 5, 2024

AdamGS added 2 commits September 5, 2024 15:39

Much faster unary

30d43c7

clippy :(

08bbd8d

robert3005 reviewed Sep 5, 2024

View reviewed changes

robert3005 approved these changes Sep 5, 2024

View reviewed changes

vortex-dtype/src/dtype.rs Outdated Show resolved Hide resolved

remove unused code

2aaab07

AdamGS merged commit 55220c0 into develop Sep 5, 2024
4 checks passed

AdamGS deleted the adamg/elementwise branch September 5, 2024 16:52

AdamGS mentioned this pull request Sep 6, 2024

Primitive binary/unary are not as fast as they could be apache/arrow-rs#6364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unary and Binary functions trait #726

Unary and Binary functions trait #726

AdamGS commented Sep 4, 2024 •

edited

Loading

AdamGS Sep 4, 2024 •

edited

Loading

robert3005 commented Sep 4, 2024 •

edited

Loading

AdamGS commented Sep 4, 2024 •

edited

Loading

robert3005 commented Sep 4, 2024

AdamGS commented Sep 5, 2024 •

edited

Loading

robert3005 commented Sep 5, 2024

AdamGS commented Sep 5, 2024

robert3005 commented Sep 5, 2024

AdamGS commented Sep 5, 2024 •

edited

Loading

AdamGS commented Sep 5, 2024

robert3005 Sep 5, 2024

robert3005 left a comment

Unary and Binary functions trait #726

Unary and Binary functions trait #726

Conversation

AdamGS commented Sep 4, 2024 • edited Loading

Benchmarks

AdamGS Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

robert3005 commented Sep 4, 2024 • edited Loading

AdamGS commented Sep 4, 2024 • edited Loading

robert3005 commented Sep 4, 2024

AdamGS commented Sep 5, 2024 • edited Loading

robert3005 commented Sep 5, 2024

AdamGS commented Sep 5, 2024

robert3005 commented Sep 5, 2024

AdamGS commented Sep 5, 2024 • edited Loading

AdamGS commented Sep 5, 2024

robert3005 Sep 5, 2024

Choose a reason for hiding this comment

robert3005 left a comment

Choose a reason for hiding this comment

AdamGS commented Sep 4, 2024 •

edited

Loading

AdamGS Sep 4, 2024 •

edited

Loading

robert3005 commented Sep 4, 2024 •

edited

Loading

AdamGS commented Sep 4, 2024 •

edited

Loading

AdamGS commented Sep 5, 2024 •

edited

Loading

AdamGS commented Sep 5, 2024 •

edited

Loading