Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Linq extension HasDuplicates() and Duplicates() #30582

Open
dmcnaughton opened this issue Aug 13, 2019 · 9 comments
Open

Proposal: Linq extension HasDuplicates() and Duplicates() #30582

dmcnaughton opened this issue Aug 13, 2019 · 9 comments
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Linq wishlist Issue we would like to prioritize, but we can't commit we will get to it yet
Milestone

Comments

@dmcnaughton
Copy link

A common scenario that I have come across in projects, as well as having seen others work around, has been dealing with identifying whether a set has duplicates as well as getting the duplicate values/counts.

I've written extensions for projects that utilize the optimized Set internal class that does a fast-return when the first duplicate object is hit for HasDuplicates(), as well as adding an int object to the to the Slot internal class that handles the number of times the object has been added/attempted to be added to the Set.

This would improve performance over the most common way I have seen developers do this:

HasDuplicates
Current workaround:

public static bool HasDuplicates<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer){
    return source.Count() == source.Distinct(comparer).Count();
}

Optimized method (using Set):

public static bool HasDuplicates<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer){
    var set = new Set<TSource>(comparer);
    foreach (var element in source)
    {
        if (!set.Add(element))
        {
            return true;
        }
     }
     return false;
}

Duplicates
Current workaround:

public static IEnumerable<KeyValuePair<TSource, int>> Duplicates<TSource>(this IEnumerable<TSource> source, IEqualityComparer<TSource> comparer){
    return source.GroupBy(element => element).Select(x =>
                new KeyValuePair<TSource, int>(x.Key, x.Count())).Where(x => x.Value > 1);
}

Optimized method (using modified Set and Slot):

public static IEnumerable<KeyValuePair<TSource, int>> Duplicates<TSource>(this IEnumerable<TSource> source, IEqualityComparer<TSource> comparer){
    var set = new Set<TSource>(comparer);
    foreach(var element in source)
    {
        set.Add(element);
    }
    return set.ToEnumerableWithCount();
}

New method to add to Set

public IEnumerable<KeyValuePair<TElement,int>> ToEnumerableWithCount(){
    return _slots.Where(x => x._count > 1).Select(x =< new KeyValuePair<TElement, int>(x._value, x._count));
}

There would be a couple other changes that would be made to the Add & Remove methods on Set<TSource>, as well as adding the internal int _count property to Slot<TSource>.

@svick
Copy link
Contributor

svick commented Aug 13, 2019

How much would these additions improve performance? Specifically, I would like to see some benchmarks comparing:

  1. Your oneliner "workarounds".
  2. Still fairly simple ~10 line methods using HashSet<T> and Dictionary<T, int>.
  3. The proposed additions.

Also, can you explain the logic behind the behavior of Duplicates? In what situations would you want counts of all items, except those that are only once in the collection? That doesn't sound like something that would be commonly useful.

@Wraith2
Copy link
Contributor

Wraith2 commented Aug 13, 2019

And why would this be better in corefx than in a library on Nuget?

@Thaina
Copy link

Thaina commented Aug 14, 2019

might be related #23080 Linq Permutation Cross and Multiple variable for Zip and Join

var hasDup = list1.Cross(list2).Any((l,r) => l == r);
var dups = list1.Cross(list2).Where((l,r) => l == r);

@GSPP
Copy link

GSPP commented Sep 11, 2019

I found the need for this fairly frequently. It happens a lot when ingesting data from external sources. It's sometimes necessary to validate that the data has some unique key. For example:

if (!incomingData.Select(x => x.SomeKey).IsDistinct())
    throw ...;

This is also useful in assertions in test and in production code.

I'd call this method IsDistinct.

To me, performance is a secondary concern here. But performance can certainly be better by using a low-overhead internal set collection type. The public hash table based types have certain overheads to them in the name of good API design.

There are also the usual optimizations such as testing for sequence as ICollection<T> c && c.Count == 0 and such. User code can do that but it's nice to have a high-quality implementation built-in.

Obtaining duplicates is also a common thing in my experience but I don't think this belongs in the framework. It is not common enough and the API shape would vary a lot based on use case.

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the Future milestone Feb 1, 2020
@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Sep 2, 2020
@NetMage
Copy link

NetMage commented Sep 25, 2020

@GSPP Note: Based on testing in .Net Core 5.0 RC1, the internal Set class used by e.g. Distinct is slower than using HashSet I think due to optimizations ported from Dictionary. I opened an issue #42760 for this.

@En3Tho
Copy link
Contributor

En3Tho commented Sep 27, 2020

@GSPP From the look of your example you don't really need to call Select / Distinct combo, but just Single/SingleOrDefault.This way you can throw on second already element if it's present

@eiriktsarpalis
Copy link
Member

I've had to write similar code very frequently, although in most cases some kind of error reporting was required (e.g. needing to return the indices of the dupes as well). So I suspect a general-purpose duplicate detection method is probably not achievable, and many people would still have to roll their own implementation.

Another potential implementation could be the following:

public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> source, IEqualityComparer<T>? comparer = null)
{
    var set = new HashSet<T>(source, comparer);
    foreach (var element in source)
    {
        if (!set.Add(element))
        {
            yield return element;
        }
    }    
}

It has the added benefit of being able to detect duplicates using source.Duplicates().Any() without needing to enumerate the entire source, however you now lose any frequency count information.

Apropos, duplicates in F# are typically detected using its CountBy implementation, which to my knowledge doesn't have an equivalent in LINQ. A naive implementation could have looked as follows:

public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> selector, IEqualityComparer<TKey>? keyComparer = null)
{
    var dict = new Dictionary<TKey, int>(keyComparer);
    foreach (var element in source)
    {
        var key = selector(element);
        bool found = dict.TryGetValue(key, out int count);
        dict[key] = found ? count + 1 : 1;
    }

    return dict;
}

Duplicates can be calculated using source.CountBy(x => x);

@eiriktsarpalis eiriktsarpalis added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 26, 2020
@eiriktsarpalis eiriktsarpalis removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jan 13, 2021
@eiriktsarpalis eiriktsarpalis added api-needs-work API needs work before it is approved, it is NOT ready for implementation wishlist Issue we would like to prioritize, but we can't commit we will get to it yet and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Nov 1, 2021
@NN---
Copy link
Contributor

NN--- commented Nov 1, 2022

@eiriktsarpalis CountBy proposal: #77716

@ImoutoChan
Copy link

I suggest naming it AllUnique() and AllUniqueBy() for better consistency with the existing API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Linq wishlist Issue we would like to prioritize, but we can't commit we will get to it yet
Projects
None yet
Development

No branches or pull requests