-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Duplicates
extension
#1037
Introduce Duplicates
extension
#1037
Conversation
with unit test and documentations
@atifaziz i deleted my previous branch, and created a new one. But for some reason the NullArgumentTest are giving me errors saying that keySelector and source do not have any null check when they do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for some reason the NullArgumentTest are giving me errors saying that keySelector and source do not have any null check when they do.
I believe this is because your argument validation isn't happening at the time Duplicates
is called, but when the resulting sequence, which is lazy, is evaluated/iterated. There's discussion of this in the C# documentation on local functions.
it was indeed that, thanks for the clue :) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1037 +/- ##
==========================================
+ Coverage 92.50% 92.52% +0.01%
==========================================
Files 112 113 +1
Lines 3404 3413 +9
Branches 1056 1058 +2
==========================================
+ Hits 3149 3158 +9
Misses 189 189
Partials 66 66 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see you got it working. I'm posting another interim review, which is better than waiting until a full one. I'll add more if I spot something and as I get time.
I would entirely drop the overloads taking a keySelector
argument for the following reasons:
- Let's start with the simpler overloads and release them.
- One can always project the keys before combining with
Duplicates
to get the same result. - The overloads can be added in a future version, if and when needed, but…
- …we might be pleasantly surprised that they're never needed (thinking YAGNI), and…
- …they can be non-trivial to implement and have surprisingly poor runtime characteristics in the best case.
Let me expand a little bit on the last point. Suppose you have a million records/objects and you want to know if there are any duplicate keys. If you project the keys from the records via Select
and run them through Duplicates()
, and they all happen to be unique, then in the best case you'll have committed the set of all keys to memory by virtue of storing them in a hash set; all this to yield nothing. To do the same for Duplicates
that takes a keySelector
, you'll have to return the records whose keys are duplicated. In order to detect duplicates, you'll have to retain the initial record of each key. When you find that a key is duplicated, you'll have to yield that initial record and all subsequent records that belong to that duplicate key. If all records have unique keys, hopefully the best/optimistic case, you'll have committed all million records to memory. This is what I meant by surprisingly poor runtime characteristics in the best case. Usually, the key sizes will be smaller than a record size and so a simple Duplicates()
over keys won't necessarily suffer from or exhibit the same memory profile. We also have to ask the question, when would one want to ever run Duplicates
over records based on some field of a record? Would it help to report which records have those duplicate keys? If you have many of such records, then chances are you have duplicate records in the first place. It's a similar problem with DistinctBy
. You generally run it on records that are fundamentally duplicated at the source and you're using a key to retain just one of the records of a duplicated key. This is why I think we can defer the unusual case to be considered at a later time that will hopefully never arrive.
Couple of other things missing:
- Missing an entry in
README.md
afterDistinctBy
- Missing an entry in
MoreLinq/MoreLinq.csproj
afterDistinctBy
- Append your copyright notice to
bld/Copyright.props
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the follow-up! I made another review pass. Let me know if you have questions or if something's unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is currently breaking the build.
i'll look at it |
This is a squashed merge of PR morelinq#1041. --------- Co-authored-by: Stuart Turner <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done my full review now. There are just some changes needed in the tests.
We're nearly there to get this merged soon, 🏁 so appreciate all your follow-up!
fix formatting Co-authored-by: Atif Aziz <[email protected]>
Duplicates
extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julienasp LGTM! Thanks for working through the review, all the feedback, patience, your contribution and sticking around!
Thank you also @leandromoh, @Orace and @viceroypenguin for your comments and reviews on #1001 and here. |
thanks for the feedback, i learned alot in the process. Thanks also for the detailed explanations! |
with unit test and documentations
Note by @atifaziz: This PR supersedes #1001 and addresses #125.