Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix sorting bugs (esp MissingOptimization) that come up when using SortingAlgorithms.TimSort #50171

Merged
merged 7 commits into from
Jun 16, 2023

Conversation

LilithHafner
Copy link
Member

@LilithHafner LilithHafner commented Jun 14, 2023

These bugs in Base led to a bug in DataFrames.jl. The bugs this fixes are regressions introduced in 1.9.

The systemic issue is that Base only uses and therefore only tests a subset of its sorting internals, while SortingAlgorithms.jl uses a larger subset, and SortingAlgorithms.jl has poor testing.

More specifically, SortingAlgorithms.TimSort is defined as Base.Sort.InitialOptimizations(TimSortAlg()), which results in some preprocessing taking place before the Base.sort!(v::AbstractVector, ::TimSortAlg, ::Ordering) method gets called. This preprocessing results in v sometimes being a Base.Sort.WithoutMissingVector. TimSort calls v[b] where b is a unit range (why? idk, it would probably be more performant not to but I haven't read the whole algorithm). This is a perfectly reasonable thing to do, but getindex(::Base.Sort.WithoutMissingOptimization, ::UnitRange) was buggy. This PR fixes that.

Also, there was a bug in the fastpath for filtering missings to the end under a Perm ordering. We (I) assumed that lo:hi == eachindex(v), which is not guaranteed. It's pretty hard to trigger this bug without calling a 5-arg sort! method, but TimSort does call such a method as its base case, so this branch is broken there too. Again, this PR removes the buggy assumption that lo:hi == eachindex(v).

Finally, the mechanism to implicitly convert the old 3- 5- or 6-argument sort! calling convention to the new 4-argument _sort! and back without triggering infinite loops when neither is defined was too strict about infinite loop detection. It prohibits anything that calls into sorting using the old convention, implicitly converts to the new convention, and then at any point later implicitly converts back to the old convention. This prohibits, for example, calling sort!(v, SortingAlgorithms.TimSort, Base.Order.Forward) because SortingAlgorithms.TimSort is defined as Base.Sort.InitialOptimizations(SortingAlgorithms.TimSortAlg()), so that old-style sort! call is implicitly converted to the new style, benefits from modern intermediary layers, and then implicitly converted back to the old style for TimSortAlg. This throws because of the conversion back and forth. Technically, this bug isn't visible in the public API because InitialOptimizations is internal, as is _sort! but it feels bad enough to be worth backporting a fix. The fix is to switch from tracking "has there been a legacy dispatch into this sorting call chain" to "what was the alg that was used at that entrypoint, if any". Then, we only throw if implicitly convert to the _sort! system and then implicitly convert back out without unwrapping a single layer of the algorithm, which is a much tighter bound than we had before.

Also makes a misleading error message not misleading, which I'd also call a bugfix in this case.

@LilithHafner LilithHafner added bugfix This change fixes an existing bug missing data Base.missing and related functionality sorting Put things in order backport 1.9 Change should be backported to release-1.9 labels Jun 14, 2023
@LilithHafner LilithHafner added this to the 1.10 milestone Jun 14, 2023
@LilithHafner LilithHafner changed the title Fix some MissingOptimization bugs that come up when using SortingAlgorithms.TimSort Fix sorting bugs (esp MissingOptimization) that come up when using SortingAlgorithms.TimSort Jun 15, 2023
@LilithHafner
Copy link
Member Author

@nanosoldier runtests(["SortingAlgorithms"])

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - no new issues were detected.
A full report can be found here.

@LilithHafner
Copy link
Member Author

1 packages passed tests only on the current version.

SortingAlgorithms v1.1.1: ok vs. fail

@LilithHafner
Copy link
Member Author

@nanosoldier runtests(["DataFrames"])

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - no new issues were detected.
A full report can be found here.

@LilithHafner LilithHafner merged commit ba251e8 into master Jun 16, 2023
@LilithHafner LilithHafner deleted the sort-MissingOptimization-bugfixes branch June 16, 2023 20:49
@KristofferC KristofferC mentioned this pull request Jun 26, 2023
36 tasks
KristofferC pushed a commit that referenced this pull request Jun 27, 2023
…SortingAlgorithms.TimSort (#50171)

(cherry picked from commit ba251e8)
KristofferC pushed a commit that referenced this pull request Jun 27, 2023
…SortingAlgorithms.TimSort (#50171)

(cherry picked from commit ba251e8)
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug missing data Base.missing and related functionality sorting Put things in order
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants