Hash.Execute() allocates a string which gets to the large object heap. #7086

AR-May · 2021-11-25T12:17:11Z

I noticed that Hash.Execute() sometimes allocates a big string.

It happens because we the first join all the items into one string and then apply hashing algorithm.
It would be nice to try hash it one by one or use a buffer with fixed length to break this big string into chunks.

Additional info:
There was an attempt to improve this function already: #5560

The text was updated successfully, but these errors were encountered:

danmoseley · 2021-11-30T15:15:22Z

You might use HashCode.Combine. It's in box for .NET Core 2.1+ but it's available in a NuGet package for .NET Framework so no multi targeting is necessary.

AR-May · 2021-11-30T18:01:56Z

Thank you for the suggestion. I looked at HashCode struct.
I think that collision rate of hashing algorithm in HashCode might not suffice for our goals. We need it to be low, because, as far as I know, there is no further verification - if hashes coincide we think that objects (list of files for compilation) coincide too. At this moment sha1 is used (and we do not care about it (not) being cryptographic, rather caring about it's collision ratio) and switching to HashCode may lead to regressions.
I also have not seen an option to set a random seed for HashCode. We need the algorithm to be stable between incremental builds.

danmoseley · 2021-11-30T18:36:59Z

If hashes coincide we think that objects (list of files for compilation) coincide too.

If you're relying on different objects to have different hash codes, that's not hashing and hash codes should not be used for this. You need some kind of unique identifier.

If your goal is to just have a measurably small rate of collisions, hopefully a failure just results in lower efficiency rather than incorrect behavior. If this is your goal yes you would want to use a sufficiently large cryptographic hash, not a regular hash code as it has no particular guarantees and could choose to generate poorly distributed codes for efficiency reasons.

I also have not seen an option to set a random seed for HashCode. We need the algorithm to be stable between incremental builds.

As far as I'm aware, all the hash code generation in the core libraries is stable except for String. We randomize the hash codes of string by default, to make DOS attacks more difficult. Of course it is possible to generate your own hashcodes for strings, if you need stable ones.

AR-May · 2021-12-01T19:43:10Z

Well, using hash for such goals is current behavior and initial goal of this issue was just to remove unnecessary LOH allocations in hash computations rather than rethinking the whole approach.

Thinking of the usage of hash function, from one point of view, I agree that such usage of hash function is not quite common and indeed might be dangerous. Collision, as far as i know, will result in build error and the behavior of MSBuild thus would be not correct.
From another point of view, probability of the collision is really very small with sha1. Unique identifiers at the same time sometimes are big enough to get in Large Object Heap, we do not want to save and work with them unless necessary.
If the build fails it will not be a huge problem also - we will just need to use rebuild instead of incremental build. After that we will not get further failures, so the error will not be consistent.

cc @rainersigwald
What do you think about that? Also, am i right with my current understanding of the usage of hash task?

rainersigwald · 2021-12-01T20:28:55Z

The cost of a collision here is silent incorrect underbuild, which is pretty bad as build errors go but was deemed to be acceptable for this case, especially since we shouldn't get any particularly adversarial input. As you say, the workaround is to do a full build.

We're looking at options to improve this, for example #7043. For now I think you're on a fine track @AR-May.

Fixes #7086 ### Context `Hash.Execute()` allocates a string which gets to the large object heap. This could be avoided without changing the resulting hash function. ### Changes Made Hash function is rewritten. ### Testing Unit tests & manual testing

AR-May added the performance label Nov 25, 2021

AR-May self-assigned this Nov 25, 2021

This was referenced Nov 25, 2021

ExpandItemVectorIntoString interns string which gets on the large object heap #2678

Closed

Epic: Memory optimizations #6940

Open

ladipro added the size:1 label Dec 6, 2021

AR-May mentioned this issue Dec 20, 2021

Remove unnecessary allocations in Hash task. #7162

Merged

ladipro closed this as completed in #7162 Jan 21, 2022

AR-May added the triaged label Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hash.Execute() allocates a string which gets to the large object heap. #7086

Hash.Execute() allocates a string which gets to the large object heap. #7086

AR-May commented Nov 25, 2021 •

edited

Loading

danmoseley commented Nov 30, 2021

AR-May commented Nov 30, 2021 •

edited

Loading

danmoseley commented Nov 30, 2021

AR-May commented Dec 1, 2021

rainersigwald commented Dec 1, 2021

Hash.Execute() allocates a string which gets to the large object heap. #7086

Hash.Execute() allocates a string which gets to the large object heap. #7086

Comments

AR-May commented Nov 25, 2021 • edited Loading

danmoseley commented Nov 30, 2021

AR-May commented Nov 30, 2021 • edited Loading

danmoseley commented Nov 30, 2021

AR-May commented Dec 1, 2021

rainersigwald commented Dec 1, 2021

AR-May commented Nov 25, 2021 •

edited

Loading

AR-May commented Nov 30, 2021 •

edited

Loading