Tracking issue for `slice::partition_dedup/by/by_key` #54279

Kerollmops · 2018-09-16T19:54:41Z

Add three methods to the slice type, the partition_dedup, partition_dedup_by and partition_dedup_by_key. The two other methods are based on slice::partition_dedup_by.

Steps / History

Implement the feature as a PR (Introduce the partition_dedup/by/by_key methods for slices #54058)
Stabilization PR (see instructions on forge)

Unresolved Questions

Should the methods only return one slice instead of a tuple with two slices? (comment)
Annotate it with #[must_use]?
slice::sort_by_key has more restrictions than slice::sort_by #34162

The text was updated successfully, but these errors were encountered:

SimonSapin · 2020-01-09T09:42:40Z

This issue tracks:

impl<T> [T] {
    pub fn partition_dedup(&mut self) -> (&mut [T], &mut [T])
    where
        T: PartialEq,
    {
        self.partition_dedup_by(|a, b| a == b)
    }

    pub fn partition_dedup_by<F>(&mut self, mut same_bucket: F) -> (&mut [T], &mut [T])
    where
        F: FnMut(&mut T, &mut T) -> bool,
    {…}

    pub fn partition_dedup_by_key<K, F>(&mut self, mut key: F) -> (&mut [T], &mut [T])
    where
        F: FnMut(&mut T) -> K,
        K: PartialEq,
    {
        self.partition_dedup_by(|a, b| key(a) == key(b))
    }
}

The doc-comment of partition_dedup_by (the others are similar) is:

Moves all but the first of consecutive elements to the end of the slice satisfying a given equality relation.

Returns two slices. The first contains no consecutive repeated elements. The second contains all the duplicates in no specified order.

The same_bucket function is passed references to two elements from the slice and must determine if the elements compare equal. The elements are passed in opposite order from their order in the slice, so if same_bucket(a, b) returns true, a is moved at the end of the slice.

If the slice is sorted, the first returned slice contains no duplicates.

Examples
#![feature(slice_partition_dedup)]

let mut slice = ["foo", "Foo", "BAZ", "Bar", "bar", "baz", "BAZ"];

let (dedup, duplicates) = slice.partition_dedup_by(|a, b| a.eq_ignore_ascii_case(b));

assert_eq!(dedup, ["foo", "BAZ", "Bar", "baz"]);
assert_eq!(duplicates, ["bar", "Foo", "BAZ"]);

From the implementation PR:

The benefits of adding these methods is that it is now possible to:

deduplicate a slice without having to allocate and possibly clone elements on the heap, really useful for embedded stuff that can't allocate for example.

not loose duplicate elements, because, when using Vec::dedup, duplicates elements are dropped. These methods add more flexibillity to the user.

SimonSapin · 2020-01-09T09:55:06Z

These methods feel very niche to me (given we already have the Vec::dedup family of methods), similar to some things we currently have in https://crates.io/crates/itertools rather then std::iter. Still, multiple people were involved in the implementation PR and were evidently happy enough to land it. So I’d be inclined to propose FCP to stabilize.

@rust-lang/libs, any thoughts?

Lokathor · 2020-01-09T09:59:05Z

Note that Vec::dedup is actually implemented in terms of these methods, so it's very important to anyone trying to implement vec-like replacements that these become available on stable.

LukasKalbertodt · 2020-01-09T10:25:25Z

@leonardo-m brought up the question of whether the second element of the returned tuple is even used. If not, it would simplify the API to return only one slice instead of a tuple. So it would be really useful to know how people use this feature. If it's merely used as a way to dedup without allocation, then yes, the first slice is sufficient. All usages I was able to find via a quick GH search indeed only used the first returned slice.

But as I agree this is pretty niche, I think it's fine if the API is a bit more complicated. Niche APIs have a stronger focus on being powerful than being easy to use, I think.

My initial thought also was that this API is too niche and shouldn't be included, but apparently it is used a lot (even excluding Rust forks, there are quite a few projects using it on the first few pages of the search results). So I'd be fine with stabilizing this as soon as the return type question is resolved.

SimonSapin · 2020-01-09T11:50:29Z

As far as I can tell if the method returns one slice the other behavior can still be achieved by only keeping the length of that returned slice and then using split_at_mut on the original slice.

Kerollmops · 2020-01-09T12:12:54Z

I thought that it would be great to also add the #[must_use] attribute to these functions.

kennytm · 2020-01-09T12:13:46Z

The link above gives ~1500 results. Excluding the word "slice_internals" and "fmt_internals" and forcing the language to be Rust leaves only 12 results. So 99% of this "lot" are very likely just rustc clones.

A more accurate search would be:

https://github.com/search?l=Rust&q=partition_dedup+NOT+slice_internals+NOT+fmt_internals+NOT+test_binary_search_implementation_details&type=Code (12 results)
https://github.com/search?l=Rust&q=partition_dedup_by+NOT+slice_internals+NOT+fmt_internals+NOT+test_binary_search_implementation_details+NOT+vec_intoiter_debug&type=Code (10 results)
https://github.com/search?l=Rust&q=partition_dedup_by_key+NOT+slice_internals+NOT+fmt_internals+NOT+vec_intoiter_debug&type=Code (1 result)

Only 3 examples actually try to use the second element:

https://github.com/treuherz/AdventOfCode/blob/152da4d32c4d6a258ec669f2fc175db99a4e1df8/18/rs/src/bin/2.rs#L66-L84 uses the second part of partition_dedup() to locate the first duplicated element.

    let (_, dupes) = working.partition_dedup();
    if !dupes.is_empty() {
        assert_eq!(dupes.len(), 1);
        return dupes.first().unwrap().to_string();
    }

but I don't think it is a good example, as it's easier to just write

    if let Some(first_dupe) = working.windows(2).find(|(p, q)| p == q) {
        return first_dupe.to_string();
    }

https://github.com/sachaarbonel/poker.rs/blob/1f4ef45872307b1a3c40795ff5b125b6d82cd68f/src/bin.rs uses the second part to determine if the poker hand has pairs/triples/4 of a kind. But the message already shows how unreliable it is for the purpose.
```
let (dedup, _dup) = numbers.partition_dedup();
println!("dedup {:?} dup {:?}",dedup, _dup );
match _dup.len() {
     0 => println!("no pair"),
     1 => println!("one pair"),
     2 => println!("three of a kind or two pairs, do an other match"),
     3 => println!("four of a kind"),
     _ => println!("not handled"),
}
```
The actual implementation in https://github.com/sachaarbonel/poker.rs/blob/1f4ef45872307b1a3c40795ff5b125b6d82cd68f/src/hand/mod.rs uses two calls of partition_dedup to distinguish two-pair (AA22) vs triples (AAA) and four-of-a-kind (AAAA) vs full-house (AAA22).

IMO partition_dedup is far from useful as a poker hand classifier since it can't handle flushes or straights.

https://github.com/smmalis37/aoc2019/blob/c1015c7d2bba8075176650ac84f4f3c53b63e1d3/src/days/day10.rs#L86-L101 Probably the only legit use case though I can't quite figure out what it's trying to do.

loop {
    subangles.sort_unstable_by_key(|ad| ad.1);
    subangles.sort_by_key(|ad| ad.0);
    let (left, right) =
        subangles.partition_dedup_by(|&mut ad1, &mut ad2| float_equals(ad1.0, ad2.0));
    if seek < left.len() {
        let (angle, distance) = left[seek];
        let coord = angle_distance_to_coord(part1_coord, angle, distance);
        return (coord, coord.x * 100 + coord.y);
    } else {
        seek -= left.len();
        subangles = right;
    }
}

Conclusion: The second element is extremely niche, and encourages bad algorithm. As mentioned in #54279 (comment), the original slice is not destroyed, so the two-element version can be easily reimplemented by the one-element version using split_at_mut (heck, the current implementation also uses split_at_mut):

fn partition_dedup(&mut self) -> (&mut Self, &mut Self) {
    let p = self.dedup_in_place().len();
    self.split_at_mut(p)
}

There is no reason to keep the second element from the return type.

LukasKalbertodt · 2020-01-09T19:09:02Z

@kennytm Wow, thanks for the research! I clearly didn't put enough time into my comment and was indeed fooled by my search.

In case we change the return type to only return the first part of the slice, maybe we should also think about renaming the methods. Shepmaster suggested the current name partially because of the 2-tuple return type. I don't mind the current name (the algorithm still does some kind of partitioning), but another name might be more appropriate: the return type is different than Iterator::partition and as we seem to have established, almost no one is interested in the second bucket. What about dedup_in_place?

Lokathor · 2020-01-09T19:43:01Z

Note: The side effect of this method is that the input slice is swapped all around on success so that the initial span of the spice is the deduped portion. Thus, must_use would be inappropriate.

smmalis37 · 2020-02-08T10:11:29Z

Oh hey I've been mentioned. Yeah that use case is very much artificial, it was part of the Advent of Code, a series of puzzles. The puzzle for day 10 can be found at https://adventofcode.com/2019/day/10, although you'd need to solve part 1 before it reveals part 2 which this was part of. However as you said i could easily just use split_at_mut and be fine.

dtolnay · 2020-02-08T22:50:56Z

I would be inclined to remove these methods. This is an uncommon use case. The algorithm can be implemented in ~20 lines of safe code if you don't mind the bounds checks. I think the amount of benefit we'd deliver by stabilizing these is much less than what it costs to put these fairly complicated names and signatures in front of everyone in the documentation of slices.

smmalis37 · 2020-02-08T23:07:18Z

I'd vote against removing these methods. Even if their use is uncommon they are genuinely useful. They also fit with other dedup* methods that are already in std.

Lokathor · 2020-02-09T01:08:44Z

As I said, those are implemented in terms of these, so these cannot be removed entirely.

Users of Vec already see this complexity, in Stable, there's no reason that it can't be available to slices too.

dtolnay · 2020-02-09T19:26:02Z

The ones on Vec are simpler signatures, simpler names, and simpler behavior to understand, and serve a somewhat more common use case. So I don't find that those necessarily provide justification for the new slice methods.

Regarding "cannot be removed entirely": they can be removed entirely from the API of the standard library.

shepmaster · 2020-02-09T21:30:24Z

I would be inclined to remove these methods.

Are you talking about only the "complicated" version:

fn partition_dedup(&mut self) -> (&mut [T], &mut [T])

or also the "simplified" version

fn partition_dedup(&mut self) -> &mut [T]

Lokathor · 2020-02-09T22:11:27Z

I would like to see them put on normal slices so that other crates with vec-like APIs can simply use the slice versions.

But i'm fine with the simple version, since that's what Vec offers.

dtolnay · 2020-02-09T22:17:31Z

If the intended audience is crates that provide a vec-like API, I really don't think that justifies adding these to slice. It's an uncommon use case and they can write the 20 lines themselves. I don't think providing these methods makes a dent in the actual hard parts of designing and implementing a vec-like type downstream.

Lokathor · 2020-02-09T23:15:59Z

Or just anyone using slices who wants to be able to do this.

Unless your argument is that adding this in the first place to Vec was itself a mistake, then the same methods should be on slices too (the simplified versions that is).

Joey9801 · 2020-02-10T22:41:10Z

FWIW I found a partition_dedup method useful just now when completing an advent-of-code problem, for stably rearranging of a slice like [a, a, a, b, b, c, c, c] into [a, b, c, a, b, c, a, c]. (I'm sure there's a word for that rearrangement, but can't think of it right now...)

I know it's not a significant use case, but it was convenient to have it available to me (albeit only in nightly), and I did make use of both returned slices. It allowed me to express the rearrangement in only couple of lines of relatively simple code that was O(N) time / constant memory overhead.

Ignore me, I failed to read the documentation properly:

The second contains all the duplicates in no specified order.

Indeed, the relative order of the duplicates is not preserved, and I just got lucky on the AoC problem :(

Kerollmops · 2020-04-24T15:21:47Z

Hey @SimonSapin,

Do you think that if I update the function name to be dedup_in_place and the signature to return a single mutable slice it could be possible to merge this PR soon or later?

MarinPostma · 2020-04-24T16:34:36Z

I think dedup_in_place should return a tuple. The reason is that you'd expect it to work like it's Vec counterpart, whereas it doesn't, and it would lead to potentially hard to debug bugs:

//this is ok to do with vec:
let vec = vec![1, 1, 2, 3];
vec.dedup()
// can use vec, no problem
// same case with slices
let mut slice = [1, 1, 2, 3];
slice.dedup_in_place();
// uses slice, oops
// or worst:
let mut slice = [1, 1, 2, 3];
let new = slice.dedup_in_place();
//  one may use slice, ignore new and refer to “removed” elements

returning a tuple will prevent the later, making it #[must_use] shall prevent the former

kennytm · 2020-04-24T21:03:25Z

let mut slice = [1, 1, 2, 3];
let new = slice.dedup_in_place();
//  one may use slice, ignore new and refer to “removed” elements

You can't ignore new without also ignoring the unused_variables warning.

MarinPostma · 2020-04-25T00:08:16Z

first case holds

kennytm · 2020-04-25T00:52:03Z

which suggests #[must_use] is enough.

and returning a tuple prevents none of the issues raised in your sample code anyway.

MarinPostma · 2020-04-25T01:16:55Z

Fair enough. I am just concerned with the potential misuse of this function, but #[must_use] should be enough.

Robbepop · 2020-08-28T14:29:47Z

One of my projects could definitely profit from having these functions stabilized.
To answer the above questions from my point of view as potential user:

Should the methods only return one slice instead of a tuple with two slices? (comment)

For my personal use case I could use the second returned value but infering all information I need can also be done by only the first without overhead.
The main use of these functions is for constrained or performance critical code. If we can infer all information from just the deduplicated slice without noticable overhead we do not need the second slice returned imo.

Annotate it with #[must_use]?

As said earlier in this discussion I do not see a value in putting #[must_use] here since its side effects of scrambling the underlying slice in a certain way could already be useful to some users. Could we not stabilize it with #[must_use] and later remove it if we find use cases that would profit from having it removed? Should be a backwards compatible change, right? So if we stabilize this with #[must_use] we simply delay decisions for or against it. Does not need to be answered in order to stabilize this.

I see a clear value in having those definitions available since they are 20 lines of code that I do not have to write myself and that I can simply trust in my code base to be correctly implemented. If I were to write them myself I would maybe even end up with an unsafe implementation (after all it needs to be efficient) and beyond those 20 lines of mentioned code there is sitting technical doubt and thus a lot of test code.

Stargateur · 2021-08-23T09:29:47Z

If we choice to only return the dedup part, I believe we should rename dedup_in_place or something like that

clarfonthey · 2022-04-16T15:49:17Z

So, I was very confused that this method is the only one with "partition" in the name for slice when it only does deduplication, only to find that the actual partition method is called partition_in_place and is on Iterator (#62543).

The naming of these methods is a bit confusing, and I would be in favour of removing the "partition" terminology to help avoid that confusion.

emmyoh · 2023-01-11T21:03:44Z

Wanted to chime in and say that, for my purposes, having it return two vectors has been very useful (both the deduplicated result, and the items that were removed). If the opinion goes in the opposite direction (to only return the deduplicated result, not the removed duplicates), how would I go about getting a vector of removed duplicates? Sorry if that's a silly question.

safinaskar · 2023-02-09T12:20:47Z

@Kerollmops , it seems I found bug.

I usually use partition_dedup_by_key to merely assert that (presumably sorted) vector doesn't have duplicates.

I. e. my code usually looks like this: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=ee5704261d8bcb084b30e8a7de2b28ff . So far, so good.

But if I change type of key to String, problems appear. Now partition_dedup_by_key(|x|x.key) doesn't compile, and partition_dedup_by_key(|x|&x.key) doesn't compile, either. Here is two playgrounds:

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=fee945c939074d4d1373f3a52e1b2c8c
https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=9a3ff4934bd45b910b3e925fbfdac875

But surprisingly partition_dedup_by works! Look: https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=8556cc139a6f75c575dd0d5300211120

So, it seems that wrapper partition_dedup_by_key has wrong prototype. Same applies to sort_by_key

Stargateur · 2023-02-09T13:53:10Z

@safinaskar it's by design not a bug, are you asking these method should return a reference like F: FnMut(&mut T) -> &K, ? I don't think it's possible. If we remove this &mut is should be possible.

PS: F: FnMut(&mut T) -> K, Thus the fact it's give you a mutable reference to T is weird for me.

TrolledWoods · 2023-03-27T19:58:02Z

A bit of a pedantic observation, but the sort_by_key method has T: FnMut(&T) -> K, should that be matched or should &mut T be handed out because it can be?

That would still not solve the "bug" of course, and I don't think that bug can really be solved. Sure, you could make the signature F: FnMut(&mut T) -> &K (since they are separate elements even a mutable reference should be fine for this), but this would essentially require that K is a field of T, which makes the method very limited. And it wouldn't match the signature for the other _by_key methods.

safinaskar · 2023-04-06T13:22:09Z

Let me say another way. In my code I often want to dedup vectors of structs, which have String fields. Most of the time I just need to verify that there is no duplicates (i. e. what I really need in most of cases is is_dedupped).

Unfortunately the only way I can do this is to clone() this strings. This is slow. See here:

https://play.rust-lang.org/?version=nightly&mode=debug&edition=2021&gist=7d753d1967e7413d1fa38d3727acab85

So I want some API that works in this case without cloning. Or at least which give ability to verify is_dedupped

Stargateur · 2023-04-06T18:01:45Z

@safinaskar

#![feature(slice_partition_dedup)]
fn main() {
    struct Foo {
        f: String,
    }
    let mut a: Vec<Foo> = vec![];
    a.partition_dedup_by(|a, b| a.f == b.f);
}

safinaskar · 2023-04-06T19:23:52Z

@Stargateur , I don't like this solution. I'm big fan of "don't repeat yourself" principle. But, yes, this will go as pragmatic solution

Jules-Bertholet · 2023-09-26T20:49:20Z

The issue with the by_key functions is tracked in #34162.

c410-f3r · 2024-01-01T14:06:07Z

What is the status of this feature?

varkor mentioned this issue Sep 16, 2018

Tracking issue for sorting by expensive-to-compute keys (feature slice_sort_by_cached_key) #34447

Closed

csmoe added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. labels Sep 17, 2018

jonas-schievink added the B-unstable Blocker: Implemented in the nightly compiler and unstable. label Nov 26, 2019

MarinPostma mentioned this issue May 3, 2020

Tracking Issue for slice::retain #71831

Closed

shepmaster mentioned this issue May 3, 2020

Slice retain #71832

Closed

KodrAus added A-slice Area: `[T]` Libs-Small Libs issues that are considered "small" or self-contained Libs-Tracked Libs issues that are tracked on the team's project board. labels Jul 29, 2020

clarfonthey mentioned this issue Apr 16, 2022

Tracking issue for Iterator::partition_in_place #62543

Open

safinaskar mentioned this issue Apr 6, 2023

Add dedup, dedup_by and dedup_by_key to the Iterator trait #83748

Closed

Tracking issue for slice::partition_dedup/by/by_key #54279

Tracking issue for slice::partition_dedup/by/by_key #54279

Comments

Kerollmops commented Sep 16, 2018 • edited by Dylan-DPC Loading

Steps / History

Unresolved Questions

SimonSapin commented Jan 9, 2020

Examples

SimonSapin commented Jan 9, 2020

Lokathor commented Jan 9, 2020

LukasKalbertodt commented Jan 9, 2020

SimonSapin commented Jan 9, 2020

Kerollmops commented Jan 9, 2020

kennytm commented Jan 9, 2020 • edited Loading

LukasKalbertodt commented Jan 9, 2020

Lokathor commented Jan 9, 2020 • edited Loading

smmalis37 commented Feb 8, 2020

dtolnay commented Feb 8, 2020

smmalis37 commented Feb 8, 2020

Lokathor commented Feb 9, 2020

dtolnay commented Feb 9, 2020

shepmaster commented Feb 9, 2020

Lokathor commented Feb 9, 2020

dtolnay commented Feb 9, 2020

Lokathor commented Feb 9, 2020

Joey9801 commented Feb 10, 2020 • edited Loading

Kerollmops commented Apr 24, 2020

MarinPostma commented Apr 24, 2020 • edited Loading

kennytm commented Apr 24, 2020

MarinPostma commented Apr 25, 2020

kennytm commented Apr 25, 2020 • edited Loading

MarinPostma commented Apr 25, 2020

Robbepop commented Aug 28, 2020 • edited Loading

Stargateur commented Aug 23, 2021 • edited Loading

clarfonthey commented Apr 16, 2022

emmyoh commented Jan 11, 2023

safinaskar commented Feb 9, 2023

Stargateur commented Feb 9, 2023 • edited Loading

TrolledWoods commented Mar 27, 2023 • edited Loading

safinaskar commented Apr 6, 2023

Stargateur commented Apr 6, 2023

safinaskar commented Apr 6, 2023

Jules-Bertholet commented Sep 26, 2023

c410-f3r commented Jan 1, 2024

Tracking issue for `slice::partition_dedup/by/by_key` #54279

Tracking issue for `slice::partition_dedup/by/by_key` #54279

Kerollmops commented Sep 16, 2018 •

edited by Dylan-DPC

Loading

kennytm commented Jan 9, 2020 •

edited

Loading

Lokathor commented Jan 9, 2020 •

edited

Loading

Joey9801 commented Feb 10, 2020 •

edited

Loading

MarinPostma commented Apr 24, 2020 •

edited

Loading

kennytm commented Apr 25, 2020 •

edited

Loading

Robbepop commented Aug 28, 2020 •

edited

Loading

Stargateur commented Aug 23, 2021 •

edited

Loading

Stargateur commented Feb 9, 2023 •

edited

Loading

TrolledWoods commented Mar 27, 2023 •

edited

Loading