Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding _by, by_key, largest variants of k_smallest #654

Merged
merged 1 commit into from
Feb 26, 2024

Conversation

ejmount
Copy link
Contributor

@ejmount ejmount commented Oct 19, 2022

This PR resolves #586 by adding the following public functions

k_smallest_by
k_smallest_by_key
k_largest
k_largest_by
k_largest_by_key

I used a custom binary heap to implement them (and rewrite k_smallest) so that the logic could be agnostic as to where the memory came from. This is because I was originally motivated to make this PR by implementing k_smallest for const sizes so as to not require an allocator. Using a custom heap implementation makes that extension straightforward, but I've not included it just now to make this first PR more manageable. Unfortunately, this requires an MSRV of 1.36, although the clippy config seems to indicate that's already the case.

(The heap type is not publicly exposed to crate users, it is only implementation detail.)

The quickcheck test for k_smallest is completely untouched and still passes, so I am confident the heap logic is correct. I've also added doctests to some but not all of the new functions, which also pass, Let me know if you feel more documentation (in- or externally facing) is useful.

@ejmount ejmount marked this pull request as draft October 19, 2022 23:58
@ejmount ejmount marked this pull request as ready for review October 20, 2022 00:03
Cargo.toml Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
@phimuemue
Copy link
Member

Hi there! Thanks for tackling this - I'm happy you went for it.

I hope I don't miss the forest for the trees, but I am pondering whether unsafe is really needed here. It seems that it is used here because MaxHeap manually tracks memory via storage: &mut [MaybeUninit<T>] and len: usize. But wouldn't a Vec be perfectly fine for this? (Rust's own BinaryHeap uses Vec, too.)

Moreover I think that we could benefit from inlining some functions:

  • push_pop: It checks self.storage.is_empty internally, but I think we could use an early out for this case at the call site.
  • fn extend<T: IntoIterator<Item = A>>(&mut self, iter: T): I think the users do not benefit if we implement this trait for an internal data structure. It deals with stuff that is not used in practice (such as respecting self.len, although the only call to it assures that self.len==0), and thus makes things a bit more complex than necessary.
  • total_sort/capped_heapsort: They transport information of internal data structures, and as it stands now, one has to go follow function calls to find out what these values actually mean. If everything was locally, this would be a bit easier to grasp.

@ejmount
Copy link
Contributor Author

ejmount commented Oct 21, 2022

I hope I don't miss the forest for the trees, but I am pondering whether unsafe is really needed here. It seems that it is used here because MaxHeap manually tracks memory via storage: &mut [MaybeUninit<T>] and len: usize. But wouldn't a Vec be perfectly fine for this? (Rust's own BinaryHeap uses Vec, too.)

It would do for the current implementation, but in the future I wanted to be able to do something like,

fn k_smallest_static<'a, T, I, F, const K: usize>(iter: I, order: F) -> impl Iterator<Item = T>
where
    T: 'a,
    I: Iterator<Item = T>,
    F: Fn(&T, &T) -> std::cmp::Ordering,
{
    let mut storage: [MaybeUninit<T>; K] = /* ... */;
    let num_elements = capped_heapsort(iter, &mut storage, order);
    storage
        .into_iter()
        .take(num_elements)
        .map(|e| unsafe { e.assume_init() })
}

and have it be usable without any dependency on alloc.

Moreover I think that we could benefit from inlining some functions:

I generally try to write code that's robust against the caller making mistakes, although I'd agree with you that that's created overhead here, so I've changed the flow of creating and consuming the heap with the main differences being

  • capped_heapsort short-circuits on zero-length storage, so all the checks for that case disappear from the heap itself
  • The extend logic is merged into the construction process, with no trait. This also means it is (locally) obvious that the heap will start empty, so the adjustments for self.len disappear too.
  • total_sort is inlined into capped_heapsort, which now returns a (local) Range for better context of what the return value signifies

@phimuemue
Copy link
Member

phimuemue commented Oct 22, 2022

Hi, thanks for the feedback.

It would do for the current implementation, but in the future I wanted to be able to do something like,

fn k_smallest_static<'a, T, I, F, const K: usize>(iter: I, order: F) -> impl Iterator<Item = T>

Noble motives, I like that (pretty much, actually). However, I'd still argue to simply go with Vec for now, because if we want to support static-variants of our algorithms, we probably also want other algorithms to work with Vec or (essentially) ArrayVec (e.g. min_set and its cousins). And in that case, it's probably much easier to go through the codebase, look for Vec, and replace those by a trait VecOrArrayVec instead of tying the static/dynamic decision to one particular algorithm (in this case: k_smallest). That is, i'd like to move the memory management issues out of the algorithms into the containers used by the algorithms, and make the algorithms accept different containers. @jswrenn Do you have thoughts about this?

Then, I'd imagine the algorithm unsafe-free and something like:

  • let container = iter.by_ref().take(k).collect()
  • if iterator already exhausted, sort container using a sorting function (possibly avoiding heapify if it helps speed)
  • if there's more to come from the iterator, push_pop the remaining elements, and sort afterwards (possibly exploiting heap-structure if it helps speed). (In this case, we know that all elements in the container are initialized.)

On top, I'd suggest more inlining:

  • capped_heapsort: It internally checks for storage.is_empty, which could be done in the caller (avoiding to call it entirely) by testing k==0. Also, for someone new, the meaning of the range returned from capped_heapsort is probably still non-obvious, (in addition of its begin (that is always 0) being ignored at the single call site).
  • from_iter: As it stands now, it seems to rely e.g. on heap.len being 0 - something that's obvious if it was inlined.
  • push_pop: Calls e.g. self.get(0), which lifts values into Option. However, push_pop is only called for non-empty storage, which would be more obvious if it was inlined.
  • To add to the previous point: Having a compare going through Options might be a good idea, but I was thinking if we could get rid of the detour through Option entirely (without too much hassle).

src/k_smallest.rs Outdated Show resolved Hide resolved
ejmount added a commit to ejmount/itertools that referenced this pull request Oct 22, 2022
Being generic over storage at the algorithm level is a
neater strategy (see rust-itertools#654)
@ejmount
Copy link
Contributor Author

ejmount commented Oct 22, 2022

Noble motives, I like that (pretty much, actually). However, I'd still argue to simply go with Vec for now, because if we want to support static-variants of our algorithms, we probably also want other algorithms to work with Vec or (essentially) ArrayVec (e.g. min_set and its cousins). And in that case, it's probably much easier to go through the codebase, look for Vec, and replace those by a trait VecOrArrayVec instead of tying the static/dynamic decision to one particular algorithm (in this case: k_smallest). That is, i'd like to move the memory management issues out of the algorithms into the containers used by the algorithms, and make the algorithms accept different containers.

I think this is a much better strategy than what I had in mind, so I've restructured the heap logic to use a plain Vec for now with an eye to replacing it with a trait in the future. That also neatly eliminated all of the unsafe calls.

if iterator already exhausted, sort container using a sorting function (possibly avoiding heapify if it helps speed)

While I don't have benchmarks, I can't imagine many circumstances where heapify would be slower, since it works in O(k) time rather than the O(k*ln k) required for a sort.

  • capped_heapsort: It internally checks for storage.is_empty, which could be done in the caller (avoiding to call it entirely) by testing k==0. Also, for someone new, the meaning of the range returned from capped_heapsort is probably still non-obvious, (in addition of its begin (that is always 0) being ignored at the single call site).

Most of the reason for that check being placed in capped_heapsort rather than earlier was because I was expecting that function to have a static sized caller as well, which is no longer needed. In the new version, capped_heapsort has disappeared entirely into the caller, and being able to rely on truncate means there's nothing to return even if it did still exist.

  • from_iter: As it stands now, it seems to rely e.g. on heap.len being 0 - something that's obvious if it was inlined.

I'm afraid I don't follow - from_iter knows that the heap begins empty because it is created there in the first place. This is hopefully clearer in the new version that uses collect as you were suggesting earlier.

  • push_pop: Calls e.g. self.get(0), which lifts values into Option. However, push_pop is only called for non-empty storage, which would be more obvious if it was inlined.
  • To add to the previous point: Having a compare going through Options might be a good idea, but I was thinking if we could get rid of the detour through Option entirely (without too much hassle).

compare goes through Option primarily for the benefit of sift_down, where the difference between elements not existing and already being in the right order is immaterial. It does have the benefit that push_pop works on empty storage, but as you say, that never happens in practice, so I'm not sure if it's worth bringing attention to.

Copy link
Member

@phimuemue phimuemue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks much better already.

I hope I'm not bothering you too much, but please do the following:

  • Inline all auxiliary functions that are called exactly once (in particular from_iter, heapify, push_pop (which should not use compare, but just use the comparator directly without going through Option), unwrap_sorted, pop). Many of them manipulate internal data structures, and if they're inlined it's easier to grasp what they can and cannot assume about the current state.
  • After doing the above, I think we do not need MaxHeap::len anymore, and use a local variable instead. Maybe more possibilities for simplification arise.
  • Make MaxHeap::storage a &mut [T] (that can be backed by a Vec, ArrayVec, Array, and possibly others). I think we want to have contiguous memory, but that's about it. If everything is inlined, we may even be able to avoid struct MaxHeap altogether, being able to simply deal with local variables.

Cargo.toml Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
@phimuemue
Copy link
Member

Probably this should also include tests (maybe via quickcheck, comparing k_smallest... against collecting into a vector, sorting, and then picking the first/last k elements.

ejmount added a commit to ejmount/itertools that referenced this pull request Oct 23, 2022
Being generic over storage at the algorithm level is a
neater strategy (see rust-itertools#654)
src/k_smallest.rs Outdated Show resolved Hide resolved
@ejmount
Copy link
Contributor Author

ejmount commented Oct 27, 2022

I've been making updates to this as time allows, and I've just pushed the following changes

  • the calculation for child node indices is now correct for 0
  • sift_down is now iterative
  • All uses of Option have been removed and all the bounds checking logic is now explicit
  • all of the single-call auxillary heap functions have been inlined, and now work on locals rather than an explicit heap type. len actually disappeared entirely because its part of the slice being passed.

I've not added any new tests, but will see about doing so when I get more time.

Re; keyed sorts, I'm afraid I don't quite understand the part of your comment about the Vec<T> being fine. Just to make sure we're all on the same page, my understanding is that there are two ways to do this with a single allocation:

  1. time/eval efficiently:
    fn k_smallest_by_key<T, I, F, K>(iter: I, k: usize, key: F) -> impl Iterator<Item=T>
    where
        I: Iterator<Item = T>,
        F: Fn(&T) -> K,
        K: Ord,
    {
        let iter = iter.map(|v| Pair(key(&v), v));
        k_smallest_general(iter, k, Ord::cmp).map(|Pair(_, t)| t)
    }
  2. space efficiently:
    fn k_smallest_by_key<T, I, F, K>(iter: I, k: usize, key: F) -> VecIntoIter<T>
    where
        I: Iterator<Item = T>,
        F: Fn(&T) -> K,
        K: Ord,
    {
        k_smallest_general(iter, k, |a,b| key(&a).cmp(key(&b)))
    }

There is, AFAIK, no way to do this time efficiently using only a Vec<T>

If we only have one of them, my inclination would be that it should be the first one, because I expect the use cases where callers care about the extra space required would be rare enough that it wouldn't be a burden to use the 2nd version explicitly. I also don't see the return type change being too much of a problem, because I expect that almost all uses of the function will want to iterate over the results rather than collect them. (And that k is unlikely to be large enough that it's a major problem even if you did want to collect them)

Additionally, in the standard library, sort_by_cached_key came much later than sort_by_key, which (given the amount of overhead involved in new features) suggests to me that sort_by_key not caching originally wasn't a design choice that was considered and rejected but rather missed entirely.

However, the call is up to you, and I'd appreciate if you could specify whether you want the first, second or both options included.

@phimuemue
Copy link
Member

Re; keyed sorts, I'm afraid I don't quite understand the part of your comment about the Vec being fine. Just to make sure we're all on the same page, my understanding is that there are two ways to do this with a single allocation:

Well described. Please go for the space efficient solution.

@ejmount
Copy link
Contributor Author

ejmount commented Oct 28, 2022

Done. With regard to tests, would you be able to be more specific about what you think should be covered by quickcheck? I assume you don't mean all six variations separately, since they're mostly implemented in terms of each other.

@phimuemue
Copy link
Member

phimuemue commented Oct 28, 2022 via email

@ejmount
Copy link
Contributor Author

ejmount commented Oct 28, 2022

I've just added a quickcheck that tests all 6 variations against fully sorting the input data, let me know if you think there should be anything else added

@ejmount
Copy link
Contributor Author

ejmount commented Oct 28, 2022

The build failed earlier because of an accidental reference to std::vec::Vec;, which is now corrected

@TheDudeFromCI
Copy link

What else is this PR waiting on?

@phimuemue
Copy link
Member

Hi there, first of all: Thanks @ejmount for staying in touch.

I saw that you implement k_smallest in terms of your custom heap (i.e. it delegates to k_smallest_by). However, this potentially degrades performance, as Rust's BinaryHeap uses "holes" to improve performance.

Writing an own binary heap turns out to be tricky, so I'd lean to go with a safe implementation for now, sacrificing the holes for the more general variants, but possibly state this in the documentation. We can then try later to write a safe custom heap.

That is, before merging I'd suggest:

  • The existing k_smallest should use Rusts own BinaryHeap to ensure performance does not degrade.
  • New k_smallest_...-variants should for now use a safe custom binary heap (I don't feel competent to review an unsafe binary heap), but should state this in the docs.
  • If possible, the existing k_smallest and the new variants should share code, possibly abstracting the heap type via traits.

Before continuing on this, I'd ask @jswrenn and/or @scottmcm for their opinion on this.

@ejmount
Copy link
Contributor Author

ejmount commented Feb 19, 2023

This came back to my attention recently and I've made the changes you've suggested for k_smallest, (although not k_largest, it having no existing callers) but I couldn't see any particularly good way to share functionality between that and the other variants, given that k_smallest_general is very self-contained.

AFAICT, the code as it stands is identical, in both functionalty and performance, from the POV of any existing users. Assuming my upate just now addresses your concerns in the previous post, are there any other blockers for this being merged? An earlier comment is still flagged as outstanding, but I don't think it's still applicable

@douweschulte
Copy link

Great work! I was just looking for this functionality as I need k_largest_by for an algorithm I am working on. I see the PR has become a bit stale, is there something I can do to help this PR over the finish line?

@Philippe-Cholet
Copy link
Member

@douweschulte I don't know the details about this one. It's definitely something I'd like to see merged at some point though.
@phimuemue Can you shed some light on those extensions of k_smallest?

@Philippe-Cholet
Copy link
Member

I would like to revive this (most-loved) PR, and update this myself if needed. Based on #654 (comment):

@jswrenn and/or @scottmcm What is your opinion on this?

@douweschulte
Copy link

I have some additional info now as well. Since the above comment I have copy pasted this PRs code into a couple of my own projects. I have only used the k_largest function but it has worked wonders!

Copy link
Member

@phimuemue phimuemue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. I skimmed this. I'd like some changes, mainly to make the code more compact to avoid jumping back and forth while reading (local functions, narrowly scope some variables).

On top, I suggest to return Vec in all cases. (Seems canonical with what we already have.)

@Philippe-Cholet I'd like to hand this over to you, as I do not have enough capacity right now. Please let me know if that's not ok.

src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated
Self: Sized,
Self::Item: Ord,
{
self.k_smallest_by(k, k_smallest::reverse_cmp(Self::Item::cmp))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this return self.k_smallest_by(k, |x| x)? (Would probably allow us to use reverse_cmp in only k_largest_by.)

@Philippe-Cholet Philippe-Cholet self-assigned this Feb 19, 2024
@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Feb 20, 2024

@phimuemue I assigned myself here so I agree to take this off your hands.
Agree to make the code more compact.
(about Vec, see other comment)

@ejmount Hello, I was not sure you would still be around so it's nice to see you here.
I'll be around as well and this is definitely top-prio for me.
Since the time you worked on this, the repo has been finally rustfmt'ed (and CI reject compiler and clippy warnings) so I think the current conflicts are just about formatting. Personally, I would first squash everything and rebase on master to start clean. Or we can end by that, as you wish.
The main interest would be to be able run the CI.

Copy link
Member

@Philippe-Cholet Philippe-Cholet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not adding much since phimuemue already did a bunch here.

I did not look the documentation of the new methods yet. I'll do it at the end.

src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
ejmount added a commit to ejmount/itertools that referenced this pull request Feb 22, 2024
Being generic over storage at the algorithm level is a
neater strategy (see rust-itertools#654)
@ejmount
Copy link
Contributor Author

ejmount commented Feb 22, 2024

Hi @Philippe-Cholet

Since the time you worked on this, the repo has been finally rustfmt'ed (and CI reject compiler and clippy warnings) so I think the current conflicts are just about formatting. Personally, I would first squash everything and rebase on master to start clean. Or we can end by that, as you wish.

I've rebased all of this branch on top of the latest master, so that should clear up any formatting conflicts. (And that's very handy that I don't have to be careful not to check in unrelated formatting, since I run fmt by default)

I've also done my best to address all the review comments you and @phimuemue have left, let me know if anything still needs to be changed. I'm not sure if GH understands what just happened to the history, but here is the combined diff since phimuemue's last comment for convenience.

Copy link
Member

@Philippe-Cholet Philippe-Cholet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the changes here, I think we are good 🎉.
If it does not bother you, add some periods to sentences (I'm probably picky but it would help reading).

I'll update the documentation of Itertools methods later (like links, formatting preferences).
GH history seems fine to me. However, at the end, I would prefer if you squashed your commits.

src/lib.rs Outdated Show resolved Hide resolved
tests/test_std.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
src/k_smallest.rs Outdated Show resolved Hide resolved
// Also avoids unexpected behaviour with restartable iterators
iter.for_each(|val| {
// `for_each` is potentially more performant for deeply nested iterators, see its docs.
if is_less_than(&val, &mut storage[0]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

&storage[0] is enough.
I thought clippy unnecessary_mut_passed would catch that, but no. Maybe because it's a closure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't even flag it if it the closure is changed to take &I::Item specifically. Wonder if that's a clippy bug

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for the lint seems to deliberately ignore closures. I searched "closure unnecessary_mut_passed" but nothing helpful.

src/k_smallest.rs Show resolved Hide resolved
@Philippe-Cholet Philippe-Cholet added this to the next milestone Feb 23, 2024
Copy link

codecov bot commented Feb 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 94.61%. Comparing base (6814180) to head (2cd44fb).
Report is 18 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #654      +/-   ##
==========================================
+ Coverage   94.38%   94.61%   +0.22%     
==========================================
  Files          48       48              
  Lines        6665     6701      +36     
==========================================
+ Hits         6291     6340      +49     
+ Misses        374      361      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ejmount
Copy link
Contributor Author

ejmount commented Feb 24, 2024

Github seems to want @phimuemue's original comment checked off before this can land - I'm not sure if there's anything I can do about that on my end given all their comments have been resolved in the meantime.

Copy link
Member

@Philippe-Cholet Philippe-Cholet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this 🎉

@phimuemue It seems a previous review of yours is blocking us.
I'll update documentation of Itertools methods after this (just in case you review).

Copy link
Member

@phimuemue phimuemue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Philippe-Cholet, thanks @ejmount. I have only capacity to skim this, but looks good to me.

OT: I saw you kept VecIntoIter - fine with me, although I wonder why k_smallest and min_set should return different types. - Maybe we revisit this somewhen in the future.

@Philippe-Cholet
Copy link
Member

Philippe-Cholet commented Feb 24, 2024

@phimuemue Prior to this and now, k_smallest returns VecIntoIter (change that would be a breaking change) so for me variants should do the same. And if the user collect to a vector immediately then there is no reallocation, so that's not an issue.
It's true that different return types for such similar things is a bit inconsistent.

EDIT: sorted (& Co.) also return VecIntoIter. min_set (& Co.) returning vectors are the exception, not the rule.

@Philippe-Cholet Philippe-Cholet added this pull request to the merge queue Feb 26, 2024
Merged via the queue into rust-itertools:master with commit 16ce601 Feb 26, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend k_smallest with largest and key options
6 participants