Proposal: Linq extension HasDuplicates() and Duplicates() #30582

dmcnaughton · 2019-08-13T12:30:14Z

A common scenario that I have come across in projects, as well as having seen others work around, has been dealing with identifying whether a set has duplicates as well as getting the duplicate values/counts.

I've written extensions for projects that utilize the optimized Set internal class that does a fast-return when the first duplicate object is hit for HasDuplicates(), as well as adding an int object to the to the Slot internal class that handles the number of times the object has been added/attempted to be added to the Set.

This would improve performance over the most common way I have seen developers do this:

HasDuplicates
Current workaround:

public static bool HasDuplicates<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer){
    return source.Count() == source.Distinct(comparer).Count();
}

Optimized method (using Set):

public static bool HasDuplicates<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer){
    var set = new Set<TSource>(comparer);
    foreach (var element in source)
    {
        if (!set.Add(element))
        {
            return true;
        }
     }
     return false;
}

Duplicates
Current workaround:

public static IEnumerable<KeyValuePair<TSource, int>> Duplicates<TSource>(this IEnumerable<TSource> source, IEqualityComparer<TSource> comparer){
    return source.GroupBy(element => element).Select(x =>
                new KeyValuePair<TSource, int>(x.Key, x.Count())).Where(x => x.Value > 1);
}

Optimized method (using modified Set and Slot):

public static IEnumerable<KeyValuePair<TSource, int>> Duplicates<TSource>(this IEnumerable<TSource> source, IEqualityComparer<TSource> comparer){
    var set = new Set<TSource>(comparer);
    foreach(var element in source)
    {
        set.Add(element);
    }
    return set.ToEnumerableWithCount();
}

New method to add to Set

public IEnumerable<KeyValuePair<TElement,int>> ToEnumerableWithCount(){
    return _slots.Where(x => x._count > 1).Select(x =< new KeyValuePair<TElement, int>(x._value, x._count));
}

There would be a couple other changes that would be made to the Add & Remove methods on Set<TSource>, as well as adding the internal int _count property to Slot<TSource>.

The text was updated successfully, but these errors were encountered:

svick · 2019-08-13T14:35:26Z

How much would these additions improve performance? Specifically, I would like to see some benchmarks comparing:

Your oneliner "workarounds".
Still fairly simple ~10 line methods using HashSet<T> and Dictionary<T, int>.
The proposed additions.

Also, can you explain the logic behind the behavior of Duplicates? In what situations would you want counts of all items, except those that are only once in the collection? That doesn't sound like something that would be commonly useful.

Wraith2 · 2019-08-13T15:25:19Z

And why would this be better in corefx than in a library on Nuget?

Thaina · 2019-08-14T05:29:56Z

might be related #23080 Linq Permutation Cross and Multiple variable for Zip and Join

var hasDup = list1.Cross(list2).Any((l,r) => l == r);
var dups = list1.Cross(list2).Where((l,r) => l == r);

GSPP · 2019-09-11T07:58:18Z

I found the need for this fairly frequently. It happens a lot when ingesting data from external sources. It's sometimes necessary to validate that the data has some unique key. For example:

if (!incomingData.Select(x => x.SomeKey).IsDistinct())
    throw ...;

This is also useful in assertions in test and in production code.

I'd call this method IsDistinct.

To me, performance is a secondary concern here. But performance can certainly be better by using a low-overhead internal set collection type. The public hash table based types have certain overheads to them in the name of good API design.

There are also the usual optimizations such as testing for sequence as ICollection<T> c && c.Count == 0 and such. User code can do that but it's nice to have a high-quality implementation built-in.

Obtaining duplicates is also a common thing in my experience but I don't think this belongs in the framework. It is not common enough and the API shape would vary a lot based on use case.

NetMage · 2020-09-25T21:56:54Z

@GSPP Note: Based on testing in .Net Core 5.0 RC1, the internal Set class used by e.g. Distinct is slower than using HashSet I think due to optimizations ported from Dictionary. I opened an issue #42760 for this.

En3Tho · 2020-09-27T07:18:52Z

@GSPP From the look of your example you don't really need to call Select / Distinct combo, but just Single/SingleOrDefault.This way you can throw on second already element if it's present

eiriktsarpalis · 2020-11-26T18:56:24Z

I've had to write similar code very frequently, although in most cases some kind of error reporting was required (e.g. needing to return the indices of the dupes as well). So I suspect a general-purpose duplicate detection method is probably not achievable, and many people would still have to roll their own implementation.

Another potential implementation could be the following:

public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> source, IEqualityComparer<T>? comparer = null)
{
    var set = new HashSet<T>(source, comparer);
    foreach (var element in source)
    {
        if (!set.Add(element))
        {
            yield return element;
        }
    }    
}

It has the added benefit of being able to detect duplicates using source.Duplicates().Any() without needing to enumerate the entire source, however you now lose any frequency count information.

Apropos, duplicates in F# are typically detected using its CountBy implementation, which to my knowledge doesn't have an equivalent in LINQ. A naive implementation could have looked as follows:

public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> selector, IEqualityComparer<TKey>? keyComparer = null)
{
    var dict = new Dictionary<TKey, int>(keyComparer);
    foreach (var element in source)
    {
        var key = selector(element);
        bool found = dict.TryGetValue(key, out int count);
        dict[key] = found ? count + 1 : 1;
    }

    return dict;
}

Duplicates can be calculated using source.CountBy(x => x);

NN--- · 2022-11-01T22:21:49Z

@eiriktsarpalis CountBy proposal: #77716

ImoutoChan · 2022-11-03T22:42:28Z

I suggest naming it AllUnique() and AllUniqueBy() for better consistency with the existing API

msftgits transferred this issue from dotnet/corefx Feb 1, 2020

msftgits added this to the Future milestone Feb 1, 2020

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020

congyiwu mentioned this issue Jul 18, 2019

BufferBlock with non-zero count and no consumers never completes if Canceled after Complete() #27357

Closed

adamsitnik removed the untriaged New issue has not been triaged by the area owner label Sep 2, 2020

eiriktsarpalis added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 26, 2020

eiriktsarpalis removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jan 13, 2021

atifaziz mentioned this issue Nov 11, 2023

Adds HasDuplicates morelinq/MoreLINQ#1001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Linq extension HasDuplicates() and Duplicates() #30582

Proposal: Linq extension HasDuplicates() and Duplicates() #30582

dmcnaughton commented Aug 13, 2019

svick commented Aug 13, 2019

Wraith2 commented Aug 13, 2019

Thaina commented Aug 14, 2019 •

edited

Loading

GSPP commented Sep 11, 2019

NetMage commented Sep 25, 2020 •

edited

Loading

En3Tho commented Sep 27, 2020

eiriktsarpalis commented Nov 26, 2020

NN--- commented Nov 1, 2022

ImoutoChan commented Nov 3, 2022

Proposal: Linq extension HasDuplicates() and Duplicates() #30582

Proposal: Linq extension HasDuplicates() and Duplicates() #30582

Comments

dmcnaughton commented Aug 13, 2019

svick commented Aug 13, 2019

Wraith2 commented Aug 13, 2019

Thaina commented Aug 14, 2019 • edited Loading

GSPP commented Sep 11, 2019

NetMage commented Sep 25, 2020 • edited Loading

En3Tho commented Sep 27, 2020

eiriktsarpalis commented Nov 26, 2020

NN--- commented Nov 1, 2022

ImoutoChan commented Nov 3, 2022

Thaina commented Aug 14, 2019 •

edited

Loading

NetMage commented Sep 25, 2020 •

edited

Loading