manually implement `Hash` for `DefId` #91660

llogiq · 2021-12-08T11:35:21Z

This might speed up hashing for hashers that can work on individual u64s. Just as an experiment, suggested in a reddit thread on FxHasher. cc @nnethercote

Note that this should not be merged as is without cfg-ing the code path for 64 bits.

rust-highfive · 2021-12-08T11:35:24Z

r? @jackh726

(rust-highfive has picked a reviewer for you, use r? to override)

Mark-Simulacrum · 2021-12-08T11:44:55Z

@bors try @rust-timer queue

rust-timer · 2021-12-08T11:44:56Z

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

bors · 2021-12-08T11:45:08Z

⌛ Trying commit ad29818c9fa32552de852d1656757960a5487cf0 with merge abe401b21ff392a1e7153686e94b22d7949b9a0c...

bors · 2021-12-08T13:14:22Z

☀️ Try build successful - checks-actions
Build commit: abe401b21ff392a1e7153686e94b22d7949b9a0c (abe401b21ff392a1e7153686e94b22d7949b9a0c)

rust-timer · 2021-12-08T13:14:24Z

Queued abe401b21ff392a1e7153686e94b22d7949b9a0c with parent ce0f7ba, future comparison URL.

rust-timer · 2021-12-08T17:01:53Z

Finished benchmarking commit (abe401b21ff392a1e7153686e94b22d7949b9a0c): comparison url.

Summary: This change led to moderate relevant improvements 🎉 in compiler performance.

Moderate improvement in instruction counts (up to -0.4% on full builds of externs)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

llogiq · 2021-12-08T18:16:44Z

Those results appear to be a mixed bag. I suppose this might be because the CrateNum gets shifted out so those bits don't get distributed to as many bit positions by the subsequent multiplication in FxHash.

alexdotgov · 2021-12-08T19:01:09Z

compiler/rustc_span/src/def_id.rs

+// hash one u64 instead of two u32s
+impl Hash for DefId {
+    fn hash<H: Hasher>(&self, h: &mut H) {
+        ((u64::from(u32::from(self.krate)) << 32) | u64::from(u32::from(self.index))).hash(h)


Suggested change

((u64::from(u32::from(self.krate)) << 32) | u64::from(u32::from(self.index))).hash(h)

((u64::from(u32::from(self.index)) << 32) | u64::from(u32::from(self.krate))).hash(h)

It may be faster this way: https://rust.godbolt.org/z/EMeebaGb3

Yes but that would exacerbate the problem of def index bit distribution further (at least my theory is that we usually have more defs in a crate than we have crates within one build, but I may be wrong).

I'd rather reorder the fields in the type.

This is a little-endian only improvement, right?

Yeah (probably 64-bit LE only to boot) -- if you throw a --target on the godbolt for a BE arch you can see what it looks like.

In practice the most popular architectures these days are LE (x86-64 and aarch64), but I guess if one were really motivated they could use #[cfg] to keep it optimal on both.

Thanks! I always find endianness tricky to think about, just checking my reasoning :)

nnethercote · 2021-12-08T20:13:08Z

Those results appear to be a mixed bag.

Are they? The instruction count changes of significance are all small improvements.

I suppose this might be because the CrateNum gets shifted out so those bits don't get distributed to as many bit positions by the subsequent multiplication in FxHash.

I don't understand this. Can you elaborate?

llogiq · 2021-12-08T20:41:06Z

I guess I was reading too much into the results. After looking into the hash function again, I think it's probably noise. I'd like to see if reordering the fields brings more benefit. What's your take? Should we try to optimize the Hash derive for 64 bit hashers?

nnethercote · 2021-12-08T21:00:16Z

I will experiment today with different variations. I have local benchmarking set up which is much faster than CI for doing multiple perf runs.

nnethercote · 2021-12-08T23:15:28Z

Ok, some very surprising results.

v1

Like the original in this PR (with slightly different syntax for the integer conversions):

    fn hash<H: Hasher>(&self, h: &mut H) {
        (((self.krate.as_u32() as u64) << 32) | (self.index.as_u32() as u64)).hash(h)
    }

Gave some tiny instruction count wins (I only measured Full builds):

Benchmark & Profile	Scenario	% Change	Significance Factor ?
externs debug	full	-0.60%	3.00x
externs opt	full	-0.60%	3.00x
deep-vector debug	full	-0.55%	2.77x
inflate debug	full	-0.35%	1.76x

v2

Alex's suggestion of swapping the order:

    fn hash<H: Hasher>(&self, h: &mut H) {
        (((self.index.as_u32() as u64) << 32) | (self.krate.as_u32() as u64)).hash(h)
    }

Gave gigantic slowdowns. Here's the first part of the results:

Benchmark & Profile	Scenario	% Change	Significance Factor ?
stm32f4 check	full	1350.47%	6752.37x
stm32f4 debug	full	1264.39%	6321.93x
stm32f4 opt	full	1242.46%	6212.28x
unused-warnings check	full	416.26%	2081.28x
unused-warnings debug	full	385.00%	1925.00x
unused-warnings opt	full	382.23%	1911.17x
externs debug	full	336.14%	1680.69x
externs opt	full	335.54%	1677.72x
many-assoc-items debug	full	161.70%	808.52x
many-assoc-items check	full	161.44%	807.21x
many-assoc-items opt	full	161.37%	806.83x
externs check	full	94.20%	470.98x

I did a Cachegrind diff and confirmed that it was due to huge numbers of collisions. E.g. this hashbrown function:

         .               fn move_next(&mut self, bucket_mask: usize) {
         .                   // We should have found an empty bucket by now and ended the probe.
         .                   debug_assert!(
         .                       self.stride <= bucket_mask,
         .                       "Went past end of probe sequence"
         .                   );
         .               
   386,363 ( 0.02%)          self.stride += Group::WIDTH;
         6 ( 0.00%)          self.pos += self.stride;
 1,571,177 ( 0.09%)          self.pos &= bucket_mask;
         .               }

changed to this:

            .               fn move_next(&mut self, bucket_mask: usize) {
            .                   // We should have found an empty bucket by now and ended the probe.
            .                   debug_assert!(
            .                       self.stride <= bucket_mask,
            .                       "Went past end of probe sequence"
            .                   ); 
            .               
  337,778,897 ( 3.86%)          self.stride += Group::WIDTH;
            6 ( 0.00%)          self.pos += self.stride;
1,476,729,111 (16.86%)          self.pos &= bucket_mask;
            .               }   
            .           }

Wow. FxHasher does really poorly when most of the entropy is in the higher bits.

v3

Like v1, but with the fields in DefId swapped. Gave slightly better results than v1.

Benchmark & Profile	Scenario	% Change	Significance Factor ?
match-stress-enum check	full	-2.14%	10.72x
match-stress-enum debug	full	-2.04%	10.21x
match-stress-enum opt	full	-2.04%	10.19x
externs opt	full	-1.22%	6.09x
externs debug	full	-1.22%	6.08x
deep-vector debug	full	-0.56%	2.80x
webrender-wrench debug	full	-0.35%	1.75x

But I get the same test failures seen on CI, which I don't understand. I get those same failures even if I write hash like this:

    fn hash<H: Hasher>(&self, h: &mut H) {
        self.krate.hash(h);
        self.index.hash(h);
    }
}

So now I'm really confused.

llogiq · 2021-12-08T23:17:50Z

Perhaps those tests rely on the field order of DefId? I'm not sure how (I haven't had the time to look into them, but given your suggestion, I'll do that next.

llogiq · 2021-12-08T23:18:49Z

Also that's what I suspected regarding the drawback of the large shift. Perhaps shifting (or rotating?) by a smaller amount and doing xor instead of or might give us even better results? Just a hunch. That one could even work for 32 bit systems.

nnethercote · 2021-12-09T01:01:34Z

Also that's what I suspected regarding the drawback of the large shift. Perhaps shifting (or rotating?) by a smaller amount and doing xor instead of or might give us even better results? Just a hunch. That one could even work for 32 bit systems.

By "large shift" do you mean the left shift by 32? It's the obvious way to combine two 32 bit values into a 64 bit value. If you're talking about doing something more complex like xor or rotate then that's just a kind of pre-hash before the real hash, which doesn't seem worthwhile; might as well just hash the two fields in the normal way.

I did try another thing: changing FxHasher to be more like the fallback hasher from aHash.

pub(crate) const MULTIPLE: u64 = 6364136223846793005;

#[inline(always)]
pub(crate) const fn folded_multiply(s: u64, by: u64) -> u64 {
    let result = (s as u128).wrapping_mul(by as u128);
    ((result & 0xffff_ffff_ffff_ffff) as u64) ^ ((result >> 64) as u64)
}

impl FxHasher {
    #[inline]
    fn add_to_hash(&mut self, i: u64) {
        self.hash = folded_multiply(i ^ self.hash, MULTIPLE);
    }
}

The folded multiply is clever, it folds the overflow bytes from the multiply back into the result. This avoids the FxHasher weakness with higher bits and compiles down to the same number of instructions as FxHasher: https://rust.godbolt.org/z/17xqa5e5e

It doesn't go bad when the DefIf fields are hashed in the different order, but overall it's a tiny bit more instructions than FxHasher. Not sure why, maybe it inlines a little differently.

nnethercote · 2021-12-09T01:35:25Z

Thinking some more: within the compiler FxHasher is mostly hashing integers and pointers, and some strings.

The integers are all or almost all 32-bits or less, so the weakness in bits 32-63 doesn't matter.
As for the pointers, on 64-bit machines they're only really 48-bits, I think? So the amount of distinguishing information is very much weighted towards the lower bits.
As for strings (e.g. symbols), they are processed in 64-bit chunks, and so the upper bits weakness will have some effect there.

alex · 2021-12-09T01:37:46Z

As for the pointers, on 64-bit machines they're only really 48-bits, I think? So the amount of distinguishing information is very much weighted towards the lower bits.

Except the bottom 2-3 bits, which are always 0. (I'm not sure if this meaningfully impacts your point.)

(Sorry for the confusion of commenting from two different accounts!)

llogiq · 2021-12-09T06:05:26Z

The UI tests appear to check trait coherence and ambiguity. Probably a good idea to dig into that code, as it may be less robust than we'd like.

llogiq · 2021-12-09T09:23:22Z

Having had a deeper look at the failing tests, the difference appears to hinge on the iteration order of the trait lookup, which is a HashMap, so it's kind of expected that order would change when the hashing function does. I think updating the tests here should be fine.

llogiq · 2021-12-09T10:54:16Z

Ok, so I guess we want to avoid doing such optimizations generically. The risk of getting an avalanche of collisions (pardon the pun) is very real.

For this PR, before merging I think I should #[cfg]-out the change for all but 64 bits platforms. I also wonder if it would be worth to make the order of the DefId fields platform-endian-dependent, but then we'd get different UI test output on those platforms. Given that the difference isn't that large, I don't think it should block this PR.

llogiq · 2021-12-09T13:18:48Z

This should be ready now. In practice it boils down to @nnethercote 's v3 (which gave slightly better results in his benchmarks) on 64 bit systems and keeping the derived impls on all other pointer widths. Do we need another perf run or can we merge this @jackh726 ?

camelid · 2021-12-09T21:31:15Z

Also, I find this form more readable, do others?

Extracting krate = self.krate.as_u32() as u64 and index = self.index.as_u32() as u64 variables might improve it even further.

llogiq · 2021-12-09T22:14:54Z

@camelid with the comments I added, this should hopefully now be sufficiently clear.

camelid · 2021-12-09T22:28:34Z

I'd suggest making the comments into doc-comments so rustdoc picks them up. Also, what's the disadvantage of extracting local variables? I'm not sure why adding comments helps with that.

llogiq · 2021-12-09T22:43:42Z

I even made some ascii art to show where each bit ends up. So what's unclear? 😄

My reasoning for not making those doc comments is that they are for folks working on exactly that file whereas docs are for people using the type. If I just use a DefId, I don't need to care how it's going to be hashed. On the other hand, if I want to change DefId, I read the source anyway, which will include the comments.

compiler/rustc_span/src/def_id.rs

rust-timer · 2021-12-09T22:49:57Z

Finished benchmarking commit (03e04a6f4e077238d1936bfd25c5b6450327e583): comparison url.

Summary: This change led to large relevant improvements 🎉 in compiler performance.

Large improvement in instruction counts (up to -2.3% on full builds of match-stress-enum)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR led to changes in compiler perf.

@bors rollup=never
@rustbot label: +S-waiting-on-review -S-waiting-on-perf -perf-regression

camelid · 2021-12-09T22:56:34Z

I even made some ascii art to show where each bit ends up. So what's unclear? 😄

Oh, I wasn't saying it's unclear, just that the code would be a little easier to read if local variables were extracted. Your comment is definitely helpful :)

This also reorders the fields to reduce the assembly operations for hashing and changes two UI tests that depended on the former ordering because of hashmap iteration order.

nnethercote · 2021-12-10T00:57:23Z

@bors r+

bors · 2021-12-10T00:57:24Z

📌 Commit 635533b has been approved by nnethercote

klensy · 2021-12-10T09:12:46Z

Is this actually a good idea to change order of fields and relying on current behavior? As for default repr(rust): There are no guarantees of data layout made by this representation., so relying on field order is UB?

llogiq · 2021-12-10T09:39:29Z

We only rely on field order for performance, not correctness. Nothing to worry here.

bors · 2021-12-14T01:39:04Z

⌛ Testing commit 635533b with merge a2d25b4...

bors · 2021-12-14T04:29:49Z

☀️ Test successful - checks-actions
Approved by: nnethercote
Pushing a2d25b4 to master...

rust-timer · 2021-12-14T11:04:59Z

Finished benchmarking commit (a2d25b4): comparison url.

Summary: This change led to large relevant improvements 🎉 in compiler performance.

Large improvement in instruction counts (up to -2.3% on full builds of match-stress-enum)

If you disagree with this performance assessment, please file an issue in rust-lang/rustc-perf.

@rustbot label: -perf-regression

michaelwoerister · 2021-12-16T15:54:28Z

We only rely on field order for performance, not correctness. Nothing to worry here.

Field order could be guaranteed by #[repr(C)], right? Would that have any downsides?

llogiq · 2021-12-16T17:54:34Z

Probably not. Although at that point we may also want to #[cfg] the field order for 64-bit big-Endian systems.

I'll create a follow-up PR.

rust-highfive assigned jackh726 Dec 8, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Dec 8, 2021

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 8, 2021

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 8, 2021

alexdotgov reviewed Dec 8, 2021

View reviewed changes

This comment has been minimized.

Sign in to view

llogiq force-pushed the make-a-hash-of-def-ids branch from fda4652 to 7b914cc Compare December 9, 2021 12:26

apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Dec 9, 2021

llogiq force-pushed the make-a-hash-of-def-ids branch from 7b914cc to 52496b6 Compare December 9, 2021 22:14

nnethercote reviewed Dec 9, 2021

View reviewed changes

compiler/rustc_span/src/def_id.rs Show resolved Hide resolved

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Dec 9, 2021

manually implement Hash for DefId

635533b

This also reorders the fields to reduce the assembly operations for hashing and changes two UI tests that depended on the former ordering because of hashmap iteration order.

llogiq force-pushed the make-a-hash-of-def-ids branch from 52496b6 to 635533b Compare December 9, 2021 23:40

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Dec 10, 2021

bors added the merged-by-bors This PR was explicitly merged by bors. label Dec 14, 2021

bors merged commit a2d25b4 into rust-lang:master Dec 14, 2021

rustbot added this to the 1.59.0 milestone Dec 14, 2021

llogiq deleted the make-a-hash-of-def-ids branch December 14, 2021 05:38

Kobzol mentioned this pull request Dec 26, 2021

Inline DefId hash implementation #92299

Closed

adamreichold mentioned this pull request Dec 26, 2021

Use single hash tables for compoment insertion/removal edges to improve spatial locality. Ralith/hecs#225

Merged

nnethercote mentioned this pull request May 17, 2024

[DO NOT MERGE] perf run for rustc-hash candidate #125133

Closed

	((u64::from(u32::from(self.krate)) << 32) \| u64::from(u32::from(self.index))).hash(h)
	((u64::from(u32::from(self.index)) << 32) \| u64::from(u32::from(self.krate))).hash(h)

manually implement Hash for DefId #91660

manually implement Hash for DefId #91660

Conversation

llogiq commented Dec 8, 2021 • edited Loading

rust-highfive commented Dec 8, 2021

Mark-Simulacrum commented Dec 8, 2021

rust-timer commented Dec 8, 2021

bors commented Dec 8, 2021

bors commented Dec 8, 2021

rust-timer commented Dec 8, 2021

rust-timer commented Dec 8, 2021

llogiq commented Dec 8, 2021

alexdotgov Dec 8, 2021

Choose a reason for hiding this comment

llogiq Dec 8, 2021

Choose a reason for hiding this comment

nnethercote Dec 8, 2021

Choose a reason for hiding this comment

alexdotgov Dec 8, 2021

Choose a reason for hiding this comment

nnethercote Dec 8, 2021

Choose a reason for hiding this comment

nnethercote commented Dec 8, 2021

llogiq commented Dec 8, 2021

This comment has been minimized.

nnethercote commented Dec 8, 2021

nnethercote commented Dec 8, 2021

v1

v2

v3

llogiq commented Dec 8, 2021

llogiq commented Dec 8, 2021 • edited Loading

nnethercote commented Dec 9, 2021

nnethercote commented Dec 9, 2021

alex commented Dec 9, 2021

llogiq commented Dec 9, 2021

llogiq commented Dec 9, 2021

llogiq commented Dec 9, 2021

llogiq commented Dec 9, 2021

camelid commented Dec 9, 2021

llogiq commented Dec 9, 2021

camelid commented Dec 9, 2021

llogiq commented Dec 9, 2021

rust-timer commented Dec 9, 2021

camelid commented Dec 9, 2021

nnethercote commented Dec 10, 2021

bors commented Dec 10, 2021

klensy commented Dec 10, 2021

llogiq commented Dec 10, 2021

bors commented Dec 14, 2021

bors commented Dec 14, 2021

rust-timer commented Dec 14, 2021

michaelwoerister commented Dec 16, 2021

llogiq commented Dec 16, 2021

manually implement `Hash` for `DefId` #91660

manually implement `Hash` for `DefId` #91660

llogiq commented Dec 8, 2021 •

edited

Loading

llogiq commented Dec 8, 2021 •

edited

Loading