Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent StackoverflowException when hashing lists above ~30k elements #14516

Merged
merged 5 commits into from
Jan 6, 2023

Conversation

T-Gro
Copy link
Member

@T-Gro T-Gro commented Dec 28, 2022

This addresses #1838 and fixes .GetHashCode() for the built-in List<T> type.
List is internally implemented as a recursive DU, and existing codegen emits code which is not tail recursive. If the list gets bigger (~30K elements), StackoverflowException appears when hashing it.

While I understand the desire to find a general solution for all recursive DUs incl. custom ones, I consider a StackoverflowException on the main language's collection type more urgent.

This solution is therefore special-cased only for the list type.
@cartermp tried to address this via standard F# attributes in https://github.com/dotnet/fsharp/pull/9070/files , but it affected the API shape of List which is too dangerous of a change.

This PR solves this by:

  • Adding a new private hashcode method on list
  • Special-casing hashcode generation in the compiler, to understand the combination of List<T> and .CustomHashCode(comparer) member.
  • If that is the case, the generated .GetHashCode(comparer) simply calls into .CustomHashCode(comparer)

As of now, I would tend against exploring on how to offer this mechanism for other types (e.g. a new CustomHashCodeOnlyAttribute), and for sure not part of this PR because the SO for list is much more visible.
Reason is, the existing mechanisms for custom types offer a solution and enforce treating GetHashCode and Equals together => for custom types which looser backwards compatibility limits, I believe it is sufficient..

The implementation follows the one for Array, which hashes based on the first 18 elements.
Note that this is not necessary, I only did it for consistency with array.

@T-Gro
Copy link
Member Author

T-Gro commented Dec 28, 2022

Open questions:

  1. Should we hash based on first 18 elements to be consistent with array (hashing performance), or hash full contents (lower chance of collisions if difference comes later)

  2. Should any other built-in type receive similar treatment?

  3. How should this be released? The codegen can of course be added behind a language switch, but it is only relevant when Fsharp.Core dll is being built.
    Is there any benefit in linking this to a language version, when in reality it only affects people building Fsharp.Core from source?

@T-Gro T-Gro added Bug Area-Library Issues for FSharp.Core not covered elsewhere labels Dec 28, 2022
@T-Gro T-Gro self-assigned this Dec 28, 2022
@T-Gro T-Gro linked an issue Dec 28, 2022 that may be closed by this pull request
@T-Gro T-Gro marked this pull request as ready for review December 28, 2022 18:04
@T-Gro T-Gro requested a review from a team as a code owner December 28, 2022 18:04
@T-Gro T-Gro requested review from vzarytovskii and 0101 December 28, 2022 18:05
@0101
Copy link
Contributor

0101 commented Jan 2, 2023

  1. Should we hash based on first 18 elements to be consistent with array (hashing performance), or hash full contents (lower chance of collisions if difference comes later)

I'd probably stick with doing it the same as we do with array.

  1. Should any other built-in type receive similar treatment?

Can there be a SO with any other type? Maybe Set?

  1. How should this be released? The codegen can of course be added behind a language switch, but it is only relevant when Fsharp.Core dll is being built.
    Is there any benefit in linking this to a language version, when in reality it only affects people building Fsharp.Core from source?

Probably just bump FSharp.Core version.

@T-Gro
Copy link
Member Author

T-Gro commented Jan 2, 2023

  1. Should we hash based on first 18 elements to be consistent with array (hashing performance), or hash full contents (lower chance of collisions if difference comes later)

I'd probably stick with doing it the same as we do with array.

  1. Should any other built-in type receive similar treatment?

Can there be a SO with any other type? Maybe Set?

  1. How should this be released? The codegen can of course be added behind a language switch, but it is only relevant when Fsharp.Core dll is being built.
    Is there any benefit in linking this to a language version, when in reality it only affects people building Fsharp.Core from source?

Probably just bump FSharp.Core version.

Just checked:

  • Set works fine
  • Map works fine
  • Version bump happens out of the box for every release of the nuget

0101
0101 previously approved these changes Jan 2, 2023
KevinRansom
KevinRansom previously approved these changes Jan 3, 2023
@vzarytovskii
Copy link
Member

The problem that can theoretically arise is that if someone persists their hashes somewhere, it would be a breaking change for them.

@vzarytovskii vzarytovskii requested a review from dsyme January 3, 2023 20:08
@T-Gro
Copy link
Member Author

T-Gro commented Jan 3, 2023

The problem that can theoretically arise is that if someone persists their hashes somewhere, it would be a breaking change for them.

The .NET guarantee is HashCode being stable only for the same process run.
So even two runs of the same program (same version etc.) can have different HashCode (and I think it is, e.g. string's hashcode is randomized) and must not be relied upon.

psfinaki
psfinaki previously approved these changes Jan 4, 2023
@T-Gro
Copy link
Member Author

T-Gro commented Jan 5, 2023

Update after chatting with @dsyme

I will change the code to hash based on the full contents, not just the first 18 elements.

Reason:
For uses cases hashing over 18 elements (and less where it started to crash, obviously) could see a performance degradation.

Devil's advocate:
A big amount of lists that have first 18 elements constant, and only vary in what follows.
With this change, they would all get the same hashcode and therefore hash to the same bucked in dictionary implementations.

@T-Gro T-Gro dismissed stale reviews from psfinaki, KevinRansom, and 0101 via d796a1c January 5, 2023 11:58
@dsyme
Copy link
Contributor

dsyme commented Jan 5, 2023

A big amount of lists that have first 18 elements constant, and only vary in what follows.

Plausible example - "I'm hashing the files in source repositories, which are represented as lists of lines. Now they all have the same hash code, because each has the same license at the top!"

:)

dsyme
dsyme previously requested changes Jan 5, 2023
Copy link
Contributor

@dsyme dsyme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to look at the IStructuralEquatable GetHashCode imlpementation too?

@dsyme dsyme dismissed their stale review January 5, 2023 13:10

OK, just needed to check things a little mote

@T-Gro
Copy link
Member Author

T-Gro commented Jan 5, 2023

Need to look at the IStructuralEquatable GetHashCode imlpementation too?

It calls to the same method in the end.
(the CustomHashCode accepts comparator as argument, so both regular .HashCode() as well as .HashCode(comparator) for the purpose of IStructuralEquatable are supported.)

src/FSharp.Core/prim-types.fs Outdated Show resolved Hide resolved
Copy link
Contributor

@dsyme dsyme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change requested see comment

@T-Gro
Copy link
Member Author

T-Gro commented Jan 5, 2023

Need to look at the IStructuralEquatable GetHashCode imlpementation too?

It calls to the same method in the end. (the CustomHashCode accepts comparator as argument, so both regular .HashCode() as well as .HashCode(comparator) for the purpose of IStructuralEquatable are supported.)

Proof:
image

@dsyme
Copy link
Contributor

dsyme commented Jan 5, 2023

@T-Gro Separately - regarding our earlier discussion on Teams - if people want to limit their hashing they should use this little known piece of magic in FSharp.Core https://github.com/dotnet/fsharp/blob/main/src/FSharp.Core/collections.fsi#L104C8-L124.

If we were to make things more consistent, then really we should hash all of arrays all of the time for unlimited hashing (though that's also a risky change to make and I don't recommend it). Likewise my understanding is .NET hashes all of strings, for example (as does both F# limited and unlimited hashing at the moment)

@dsyme
Copy link
Contributor

dsyme commented Jan 5, 2023

Proof

Yes, agreed, I looked through and checked that, thanks

This still needs changing I think: #14516 (review)

@T-Gro
Copy link
Member Author

T-Gro commented Jan 5, 2023

Proof

Yes, agreed, I looked through and checked that, thanks

This still needs changing I think: #14516 (review)

Done, ready for re-review by everyone.

@T-Gro T-Gro merged commit d9576ca into main Jan 6, 2023
@T-Gro T-Gro deleted the list-hashcode-stackoverflow branch January 6, 2023 09:56
@abelbraaksma
Copy link
Contributor

Wow, I’m so glad this bug got resolved!! Amazing work!

Id love to see this backported into 6.x, but as you’ve said before, that’s not standard policy. It was such a big bug, and SOE cannot be captured. Funny, in all these years I never knew about the LimitedStructural type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Library Issues for FSharp.Core not covered elsewhere Bug release/17.6
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Stack overflow on calling GetHashCode on large list
7 participants