-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Linq extension HasDuplicates() and Duplicates() #30582
Comments
How much would these additions improve performance? Specifically, I would like to see some benchmarks comparing:
Also, can you explain the logic behind the behavior of |
And why would this be better in corefx than in a library on Nuget? |
might be related #23080 Linq Permutation Cross and Multiple variable for Zip and Join var hasDup = list1.Cross(list2).Any((l,r) => l == r);
var dups = list1.Cross(list2).Where((l,r) => l == r); |
I found the need for this fairly frequently. It happens a lot when ingesting data from external sources. It's sometimes necessary to validate that the data has some unique key. For example:
This is also useful in assertions in test and in production code. I'd call this method To me, performance is a secondary concern here. But performance can certainly be better by using a low-overhead internal set collection type. The public hash table based types have certain overheads to them in the name of good API design. There are also the usual optimizations such as testing for Obtaining duplicates is also a common thing in my experience but I don't think this belongs in the framework. It is not common enough and the API shape would vary a lot based on use case. |
@GSPP From the look of your example you don't really need to call Select / Distinct combo, but just Single/SingleOrDefault.This way you can throw on second already element if it's present |
I've had to write similar code very frequently, although in most cases some kind of error reporting was required (e.g. needing to return the indices of the dupes as well). So I suspect a general-purpose duplicate detection method is probably not achievable, and many people would still have to roll their own implementation. Another potential implementation could be the following: public static IEnumerable<T> Duplicates<T>(this IEnumerable<T> source, IEqualityComparer<T>? comparer = null)
{
var set = new HashSet<T>(source, comparer);
foreach (var element in source)
{
if (!set.Add(element))
{
yield return element;
}
}
} It has the added benefit of being able to detect duplicates using Apropos, duplicates in F# are typically detected using its public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> selector, IEqualityComparer<TKey>? keyComparer = null)
{
var dict = new Dictionary<TKey, int>(keyComparer);
foreach (var element in source)
{
var key = selector(element);
bool found = dict.TryGetValue(key, out int count);
dict[key] = found ? count + 1 : 1;
}
return dict;
} Duplicates can be calculated using |
@eiriktsarpalis CountBy proposal: #77716 |
I suggest naming it AllUnique() and AllUniqueBy() for better consistency with the existing API |
A common scenario that I have come across in projects, as well as having seen others work around, has been dealing with identifying whether a set has duplicates as well as getting the duplicate values/counts.
I've written extensions for projects that utilize the optimized Set internal class that does a fast-return when the first duplicate object is hit for HasDuplicates(), as well as adding an int object to the to the Slot internal class that handles the number of times the object has been added/attempted to be added to the Set.
This would improve performance over the most common way I have seen developers do this:
HasDuplicates
Current workaround:
Optimized method (using Set):
Duplicates
Current workaround:
Optimized method (using modified Set and Slot):
New method to add to Set
There would be a couple other changes that would be made to the Add & Remove methods on
Set<TSource>
, as well as adding theinternal int _count
property toSlot<TSource>
.The text was updated successfully, but these errors were encountered: