Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash.Execute() allocates a string which gets to the large object heap. #7086

Closed
Tracked by #6940
AR-May opened this issue Nov 25, 2021 · 5 comments · Fixed by #7162
Closed
Tracked by #6940

Hash.Execute() allocates a string which gets to the large object heap. #7086

AR-May opened this issue Nov 25, 2021 · 5 comments · Fixed by #7162

Comments

@AR-May
Copy link
Member

AR-May commented Nov 25, 2021

I noticed that Hash.Execute() sometimes allocates a big string.
image

It happens because we the first join all the items into one string and then apply hashing algorithm.
It would be nice to try hash it one by one or use a buffer with fixed length to break this big string into chunks.

Additional info:
There was an attempt to improve this function already: #5560

@danmoseley
Copy link
Member

You might use HashCode.Combine. It's in box for .NET Core 2.1+ but it's available in a NuGet package for .NET Framework so no multi targeting is necessary.

@AR-May
Copy link
Member Author

AR-May commented Nov 30, 2021

Thank you for the suggestion. I looked at HashCode struct.
I think that collision rate of hashing algorithm in HashCode might not suffice for our goals. We need it to be low, because, as far as I know, there is no further verification - if hashes coincide we think that objects (list of files for compilation) coincide too. At this moment sha1 is used (and we do not care about it (not) being cryptographic, rather caring about it's collision ratio) and switching to HashCode may lead to regressions.
I also have not seen an option to set a random seed for HashCode. We need the algorithm to be stable between incremental builds.

@danmoseley
Copy link
Member

If hashes coincide we think that objects (list of files for compilation) coincide too.

If you're relying on different objects to have different hash codes, that's not hashing and hash codes should not be used for this. You need some kind of unique identifier.

If your goal is to just have a measurably small rate of collisions, hopefully a failure just results in lower efficiency rather than incorrect behavior. If this is your goal yes you would want to use a sufficiently large cryptographic hash, not a regular hash code as it has no particular guarantees and could choose to generate poorly distributed codes for efficiency reasons.

I also have not seen an option to set a random seed for HashCode. We need the algorithm to be stable between incremental builds.

As far as I'm aware, all the hash code generation in the core libraries is stable except for String. We randomize the hash codes of string by default, to make DOS attacks more difficult. Of course it is possible to generate your own hashcodes for strings, if you need stable ones.

@AR-May
Copy link
Member Author

AR-May commented Dec 1, 2021

Well, using hash for such goals is current behavior and initial goal of this issue was just to remove unnecessary LOH allocations in hash computations rather than rethinking the whole approach.

Thinking of the usage of hash function, from one point of view, I agree that such usage of hash function is not quite common and indeed might be dangerous. Collision, as far as i know, will result in build error and the behavior of MSBuild thus would be not correct.
From another point of view, probability of the collision is really very small with sha1. Unique identifiers at the same time sometimes are big enough to get in Large Object Heap, we do not want to save and work with them unless necessary.
If the build fails it will not be a huge problem also - we will just need to use rebuild instead of incremental build. After that we will not get further failures, so the error will not be consistent.

cc @rainersigwald
What do you think about that? Also, am i right with my current understanding of the usage of hash task?

@rainersigwald
Copy link
Member

The cost of a collision here is silent incorrect underbuild, which is pretty bad as build errors go but was deemed to be acceptable for this case, especially since we shouldn't get any particularly adversarial input. As you say, the workaround is to do a full build.

We're looking at options to improve this, for example #7043. For now I think you're on a fine track @AR-May.

@ladipro ladipro added the size:1 label Dec 6, 2021
ladipro pushed a commit that referenced this issue Jan 21, 2022
Fixes #7086

### Context
`Hash.Execute()` allocates a string which gets to the large object heap. This could be avoided without changing the resulting hash function.

### Changes Made
Hash function is rewritten.

### Testing
Unit tests & manual testing
@AR-May AR-May added the triaged label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants