Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add helper method for taking the k smallest elements in an iterator #473

Merged
merged 13 commits into from
Dec 14, 2020
Merged

Add helper method for taking the k smallest elements in an iterator #473

merged 13 commits into from
Dec 14, 2020

Conversation

nbraud
Copy link
Contributor

@nbraud nbraud commented Sep 1, 2020

No description provided.

@nbraud
Copy link
Contributor Author

nbraud commented Sep 1, 2020

Many thanks to @Selicre and whoever else in @rustfurs provided feedback.

@nbraud
Copy link
Contributor Author

nbraud commented Sep 1, 2020

I started adding another test, but I'll keep figuring out how to generate arbitrarish iterators for another day.

src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +2303 to +2323
.into_sorted_vec()
.into_iter()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cool to use into_iter_sorted, too bad that's unstable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that sounds like something I can push for: rust-lang/rust#76234

/// itertools::assert_equal(five_smallest, 0..5);
/// ```
#[cfg(feature = "use_std")]
fn k_smallest(self, k: usize) -> VecIntoIter<Self::Item>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what's itertools policy but maybe if the return type was newtype'd as KSmallestIter or something, it would allow changing it into the future to something more efficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll address that tomorrow <3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention, but that doesn't seem feasible while keeping the guarantee that .collect::<Vec<_>> is “free” (in-place, O(1) time). The stdlib's vec iterator can do it, but does so using specialisation (which isn't stabilised yet)

src/k_smollest.rs Outdated Show resolved Hide resolved
/// less than k elements, the result is equivalent to `self.sorted()`.
///
/// This is guaranteed to use `k * sizeof(Self::Item) + O(1)` memory
/// and `O(n log k)` time, with `n` the number of elements in the input.
Copy link
Contributor

@scottmcm scottmcm Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is not what I was expecting this to be. Instead, I was expecting a wrapper around partition_at_index so that it'd be O(n) space & time.

What's implemented is interesting, though -- the lower memory use and O(n log k) is an interesting alternative to the O(n + k log n) of BinaryHeap::from_iter(it).into_sorted_iter().take(k) -- but maybe take some time to figure out a way to communicate these differences in the name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottmcm O(n) time is impossible in the general case. Otherwise, it.k_smallest(n) would sort a sequence of length n in linear time.

Re: BinaryHeap::from_iter(it).into_sorted_iter().take(k), wouldn't this have strictly-worse performance characteristics, regarding both time and space? I need to think some more about it once caffeinated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's absolutely possible if this only needs to find the k smallest elements and not also sort those elements -- the fact that this is also sorting those elements wasn't obvious from the title. (And I would say, the extra k log k of sorting them is unfortunate if someone just wants the smallest k and doesn't need them sorted. It would be nice to be able to turn them back into a vec::IntoIter for the people who would be fine with heap order, like .k_smallest(10).sum().)

So there are plenty of options here:

Algorithm space time
nth element (unsorted) n n
partial sort n n + k log k
full heap n n + k log n
k-size heap k n log k

(Which does emphasize that BinaryHeap::into_iter_sorted should only be used for not-known-ahead-of-time k.)

Copy link
Contributor Author

@nbraud nbraud Sep 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's absolutely possible if this only needs to find the k smallest elements and not also sort those

Agreed; that just seemed like a worse API to be exposing, as:

  • an undefined output order, which happens to be often sorted or almost-sorted (like heap order), can be confusing for the user (I myself thought for a moment that BinaryHeap::into_vec() sorted the result, and was surprised by that, as the examples I was testing with happened to produce a sorted result) ;
  • sorting the end result doesn't change the asymptotic complexity, for any heap-based implementation.

However, if we expose something like a fixed-size heap — and I agree it's a good idea, if only for the .extend() type of use-cases you suggested — users who do not require a sorted result could use that directly.

So there are plenty of options here:

You seem to be including partition-based algorithms — AFAIK, the only way to get better time complexity than O(n log k) — and those do not work directly for iterators; in principle, it's always possible to first collect everything into a Vec, use a partition-based algorithm, then truncate the Vec down to size k, but in practice I would expect this to be much slower.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be careful if we really want sorted output, as this forces users to pay runtime for sorting - even if they don't need it.

If API discoverability is an issue, we could still have smallest_k_sorted vs. smallest_k_unsorted as a last resort, clarifying the difference.

@scottmcm
Copy link
Contributor

scottmcm commented Sep 2, 2020

I was thinking about this some more, and since the logic in here is so cool, I was wondering if it would make sense to expose it in more ways. Notably, it seems like if there were a type for the logic, it could implement Extend so that once you have the smallest k of some set, you can .extend(more_elements) and it'd be the combined smallest k of both sets.

Dunno if that's something that'd be appropriate for itertools, though...

@nbraud
Copy link
Contributor Author

nbraud commented Sep 2, 2020

@scottmcm Do you think I should move making a fixed-size heap wrapper to a separate PR, and make this one depend on it?

@scottmcm
Copy link
Contributor

scottmcm commented Sep 2, 2020

@nbraud I don't know -- there's not clear precedent of exposing non-iterator types of itertools. So maybe it's best to just leave it as you have it until an itertools maintainer comment.

@jswrenn jswrenn added this to the next milestone Sep 3, 2020
src/k_smollest.rs Outdated Show resolved Hide resolved
src/k_smollest.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@nbraud nbraud requested a review from jswrenn October 2, 2020 10:18
@nbraud
Copy link
Contributor Author

nbraud commented Oct 8, 2020

Ping? I addressed all feedback a week ago (but I'm not sure Github notifies on new pushes)

Copy link
Member

@jswrenn jswrenn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! Sorry for the delay.

bors r+

@bors
Copy link
Contributor

bors bot commented Dec 14, 2020

Build succeeded:

@bors bors bot merged commit 00756e0 into rust-itertools:master Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants