Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unary and Binary functions trait #726

Merged
merged 24 commits into from
Sep 5, 2024
Merged

Unary and Binary functions trait #726

merged 24 commits into from
Sep 5, 2024

Conversation

AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Sep 4, 2024

Introducing specialized traits for unary (like map) and binary (like &) operations on arrays. This will allow us to have specialized and ergonomic implementations for cases when we know the encoding of an array.

Benchmarks

As of 08bbd8d running on my M3 Max macbook

arrow_unary_add         time:   [228.22 µs 233.98 µs 239.93 µs]

vortex_unary_add        time:   [1.5322 µs 1.5469 µs 1.5604 µs]

arrow_binary_add        time:   [237.74 µs 240.93 µs 243.98 µs]

vortex_binary_add       time:   [93.821 µs 94.796 µs 95.918 µs]

@@ -58,6 +61,30 @@ fn vortex_iter_flat(c: &mut Criterion) {
});
}

fn vortex_binary_add(c: &mut Criterion) {
Copy link
Contributor Author

@AdamGS AdamGS Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of dc09067d, the results are (can be reproduced by running cargo bench --bench iter unary_add):

vortex_unary_add        time:   [662.83 µs 665.39 µs 668.35 µs]
arrow_unary_add         time:   [199.29 µs 203.01 µs 207.45 µs]

so Arrow is faster, but its a much smaller difference than the previous iterators. I tried some approaches that used BytesMut and building everything through a buffer but that was much slower (1-2ms), still trying to restructure things so this isn't the final performance number.

@robert3005
Copy link
Member

robert3005 commented Sep 4, 2024

Couple of things:

  • maybe we should have iterator over the underlying values ignoring validity? I think it is likely branch predicted perfectly but maybe there's a simpler way.
  • We also shouldn't force a flatten but instead do the per batch loop in the function, then you can control if you have a full batch or a partial one.
  • I think you will want a macro for getting iterator based on ptype so you avoid casting and boxing.
  • For binary functions if other is a scalar that can be captured in the closure and evaluated as unary.
  • One minor thing there's already TryFrom<&DType> for PType

@AdamGS
Copy link
Contributor Author

AdamGS commented Sep 4, 2024

Handling known fixed sizes gives us a nice boost, I also split the benchmarks to a new file. Seems like just removing all the bound checks yields a pretty nice boost.

arrow_unary_add         time:   [201.46 µs 204.01 µs 206.92 µs]
vortex_unary_add        time:   [197.59 µs 199.92 µs 202.51 µs]

arrow_binary_add        time:   [226.86 µs 229.16 µs 231.32 µs]
vortex_binary_add       time:   [438.23 µs 440.34 µs 442.35 µs]

@robert3005
Copy link
Member

nice, you got rid of vec push, that was the one other thing that I noticed initially that could have been better

@AdamGS
Copy link
Contributor Author

AdamGS commented Sep 5, 2024

pushed a version with a macro-based dispatch for the different iterators. If anything, performance got worse, but I can't really tell if that's the issue because the variance is pretty high found a version that miri accepts + perf is as good as its been with this approach.

@robert3005
Copy link
Member

Interesting I thought not boxing the iterator would be an improvement but maybe not… if there’s no difference in perf the simpler version might be better

@AdamGS
Copy link
Contributor Author

AdamGS commented Sep 5, 2024

Agreed. I'll try and find some additional optimizations here, but if I can't find anything significant I would rather maintain a large match statement over a macro.

@robert3005
Copy link
Member

I guess we already have one indirection when loading the data to decompress so another indirection to the iterator doesn’t fundamentally change the performance profile, ie not having completely stack allocated iterator is all the difference. Let me know if I can help but doubt we will find anything here

@AdamGS
Copy link
Contributor Author

AdamGS commented Sep 5, 2024

Never say never

Screenshot 2024-09-05 at 13 41 13

@AdamGS AdamGS marked this pull request as ready for review September 5, 2024 12:58
@AdamGS AdamGS changed the title [WIP] unary/binary fn trait Unary and Binary functions trait Sep 5, 2024
@AdamGS
Copy link
Contributor Author

AdamGS commented Sep 5, 2024

no reason binary will be faster than unary :)

vortex_unary_add        time:   [1.4555 µs 1.4690 µs 1.4791 µs]

@@ -314,8 +316,141 @@ impl Array {
}
}

// This is an arbitrary value, tried a few seems like this is a better value than smaller ones,
// I assume there's some hardware dependency here but this seems to be good enough
const CHUNK_SIZE: usize = 1024;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW given fastlanes is 1024 this will likely be the best value overall

Copy link
Member

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note about code to remove

vortex-dtype/src/dtype.rs Outdated Show resolved Hide resolved
@AdamGS AdamGS merged commit 55220c0 into develop Sep 5, 2024
4 checks passed
@AdamGS AdamGS deleted the adamg/elementwise branch September 5, 2024 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants