-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve PartialEq for slices #26884
Improve PartialEq for slices #26884
Conversation
Exploiting the fact that getting the length of the slices is known, we can use a counted loop instead of iterators, which means that we only need a single counter, instead of having to increment and check one pointer for each iterator. Benchmarks comparing vectors with 100,000 elements: Before: ``` running 8 tests test eq1_u8 ... bench: 66,757 ns/iter (+/- 113) test eq2_u16 ... bench: 111,267 ns/iter (+/- 149) test eq3_u32 ... bench: 126,282 ns/iter (+/- 111) test eq4_u64 ... bench: 126,418 ns/iter (+/- 155) test ne1_u8 ... bench: 88,990 ns/iter (+/- 161) test ne2_u16 ... bench: 89,126 ns/iter (+/- 265) test ne3_u32 ... bench: 96,901 ns/iter (+/- 92) test ne4_u64 ... bench: 96,750 ns/iter (+/- 137) ``` After: ``` running 8 tests test eq1_u8 ... bench: 46,413 ns/iter (+/- 521) test eq2_u16 ... bench: 46,500 ns/iter (+/- 74) test eq3_u32 ... bench: 50,059 ns/iter (+/- 92) test eq4_u64 ... bench: 54,001 ns/iter (+/- 92) test ne1_u8 ... bench: 47,595 ns/iter (+/- 53) test ne2_u16 ... bench: 47,521 ns/iter (+/- 59) test ne3_u32 ... bench: 44,889 ns/iter (+/- 74) test ne4_u64 ... bench: 47,775 ns/iter (+/- 68) ```
r? @pcwalton (rust_highfive has picked a reviewer for you, use r? to override) |
Exploiting the fact that getting the length of the slices is known, we can use a counted loop instead of iterators, which means that we only need a single counter, instead of having to increment and check one pointer for each iterator. Benchmarks comparing vectors with 100,000 elements: Before: ``` running 8 tests test eq1_u8 ... bench: 66,757 ns/iter (+/- 113) test eq2_u16 ... bench: 111,267 ns/iter (+/- 149) test eq3_u32 ... bench: 126,282 ns/iter (+/- 111) test eq4_u64 ... bench: 126,418 ns/iter (+/- 155) test ne1_u8 ... bench: 88,990 ns/iter (+/- 161) test ne2_u16 ... bench: 89,126 ns/iter (+/- 265) test ne3_u32 ... bench: 96,901 ns/iter (+/- 92) test ne4_u64 ... bench: 96,750 ns/iter (+/- 137) ``` After: ``` running 8 tests test eq1_u8 ... bench: 46,413 ns/iter (+/- 521) test eq2_u16 ... bench: 46,500 ns/iter (+/- 74) test eq3_u32 ... bench: 50,059 ns/iter (+/- 92) test eq4_u64 ... bench: 54,001 ns/iter (+/- 92) test ne1_u8 ... bench: 47,595 ns/iter (+/- 53) test ne2_u16 ... bench: 47,521 ns/iter (+/- 59) test ne3_u32 ... bench: 44,889 ns/iter (+/- 74) test ne4_u64 ... bench: 47,775 ns/iter (+/- 68) ```
That's interesting. Now I am just a little bit confused. |
@lilianmoraru In this case, the bounds checks get eliminated by the optimizer, because it can easily prove that it would never trigger, so there's no advantage for iterators here. But using the zipped iterators, you get two pointers that both need to be incremented and checked, while the counted loop needs to increment and check just a single integer. Different problem ;-) |
Reusing the same idea as in rust-lang#26884, we can exploit the fact that the length of slices is known, hence we can use a counted loop instead of iterators, which means that we only need a single counter, instead of having to increment and check one pointer for each iterator. Using the generic implementation of the boolean comparison operators (`lt`, `le`, `gt`, `ge`) provides further speedup for simple types. This happens because the loop scans elements checking for equality and dispatches to element comparison or length comparison depending on the result of the prefix comparison. ``` test u8_cmp ... bench: 14,043 ns/iter (+/- 1,732) test u8_lt ... bench: 16,156 ns/iter (+/- 1,864) test u8_partial_cmp ... bench: 16,250 ns/iter (+/- 2,608) test u16_cmp ... bench: 15,764 ns/iter (+/- 1,420) test u16_lt ... bench: 19,833 ns/iter (+/- 2,826) test u16_partial_cmp ... bench: 19,811 ns/iter (+/- 2,240) test u32_cmp ... bench: 15,792 ns/iter (+/- 3,409) test u32_lt ... bench: 18,577 ns/iter (+/- 2,075) test u32_partial_cmp ... bench: 18,603 ns/iter (+/- 5,666) test u64_cmp ... bench: 16,337 ns/iter (+/- 2,511) test u64_lt ... bench: 18,074 ns/iter (+/- 7,914) test u64_partial_cmp ... bench: 17,909 ns/iter (+/- 1,105) ``` ``` test u8_cmp ... bench: 6,511 ns/iter (+/- 982) test u8_lt ... bench: 6,671 ns/iter (+/- 919) test u8_partial_cmp ... bench: 7,118 ns/iter (+/- 1,623) test u16_cmp ... bench: 6,689 ns/iter (+/- 921) test u16_lt ... bench: 6,712 ns/iter (+/- 947) test u16_partial_cmp ... bench: 6,725 ns/iter (+/- 780) test u32_cmp ... bench: 7,704 ns/iter (+/- 1,294) test u32_lt ... bench: 7,611 ns/iter (+/- 3,062) test u32_partial_cmp ... bench: 7,640 ns/iter (+/- 1,149) test u64_cmp ... bench: 7,517 ns/iter (+/- 2,164) test u64_lt ... bench: 7,579 ns/iter (+/- 1,048) test u64_partial_cmp ... bench: 7,629 ns/iter (+/- 1,195) ```
This branch improves the performance of Ord and PartialOrd methods for slices compared to the iter-based implementation. Based on the approach used in #26884.
} | ||
} | ||
|
||
true | ||
} | ||
fn ne(&self, other: &[B]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this has already been merged and looks fantastic but I, as a Rust n00b, had a question as to why this isn't just
fn ne(&self, other: &[B]) -> bool {
!self.eq(other);
}
Is it a performance issue to avoid the duplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even just drop this and use the default implementation of ne()
from the trait definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be interested to know this as well. Could it be because there is call to ne
on line 1480, which may not be the same as !eq
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for PartialEq
states: "Any manual implementation of ne
must respect the rule that eq
is a strict inverse of ne
; that is, !(a == b) if and only if a != b."
Under this assumption, ne
can always re-use the implementation from the trait definition. I would expect the compiler to do a good job optimising it as it can fold the negation either on the possible outcomes or on the ne
usage. Unless some benchmarking suggests otherwise, I would assume that avoiding the duplication is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose we'll have to wait to hear back from @dotdash then :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember exactly why I didn't change that, but probably I was just too lazy to come up with enough useful testcases (involving at least various calling- and inlining-scenarios) to make sure that relying on the simple negation implementation doesn't affect performance negatively. Benchmarking welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll see if I can find some time this weekend to test it out.
Exploiting the fact that getting the length of the slices is known, we
can use a counted loop instead of iterators, which means that we only
need a single counter, instead of having to increment and check one
pointer for each iterator.
Benchmarks comparing vectors with 100,000 elements:
Before:
After: