Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Duplicates extension #1037

Merged
merged 52 commits into from
Nov 19, 2023

Conversation

julienasp
Copy link
Contributor

@julienasp julienasp commented Nov 14, 2023

with unit test and documentations


Note by @atifaziz: This PR supersedes #1001 and addresses #125.

with unit test and documentations
@julienasp
Copy link
Contributor Author

julienasp commented Nov 14, 2023

@atifaziz i deleted my previous branch, and created a new one.

But for some reason the NullArgumentTest are giving me errors saying that keySelector and source do not have any null check when they do.

and this test works
image

@julienasp julienasp mentioned this pull request Nov 14, 2023
Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for some reason the NullArgumentTest are giving me errors saying that keySelector and source do not have any null check when they do.

I believe this is because your argument validation isn't happening at the time Duplicates is called, but when the resulting sequence, which is lazy, is evaluated/iterated. There's discussion of this in the C# documentation on local functions.

@julienasp
Copy link
Contributor Author

But for some reason the NullArgumentTest are giving me errors saying that keySelector and source do not have any null check when they do.

I believe this is because your argument validation isn't happening at the time Duplicates is called, but when the resulting sequence, which is lazy, is evaluated/iterated. There's discussion of this in the C# documentation on local functions.

it was indeed that, thanks for the clue :)

@julienasp julienasp requested a review from atifaziz November 14, 2023 13:48
Copy link

codecov bot commented Nov 14, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (a0f72ab) 92.50% compared to head (5095c6f) 92.52%.
Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1037      +/-   ##
==========================================
+ Coverage   92.50%   92.52%   +0.01%     
==========================================
  Files         112      113       +1     
  Lines        3404     3413       +9     
  Branches     1056     1058       +2     
==========================================
+ Hits         3149     3158       +9     
  Misses        189      189              
  Partials       66       66              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see you got it working. I'm posting another interim review, which is better than waiting until a full one. I'll add more if I spot something and as I get time.

I would entirely drop the overloads taking a keySelector argument for the following reasons:

  • Let's start with the simpler overloads and release them.
  • One can always project the keys before combining with Duplicates to get the same result.
  • The overloads can be added in a future version, if and when needed, but…
  • …we might be pleasantly surprised that they're never needed (thinking YAGNI), and
  • …they can be non-trivial to implement and have surprisingly poor runtime characteristics in the best case.

Let me expand a little bit on the last point. Suppose you have a million records/objects and you want to know if there are any duplicate keys. If you project the keys from the records via Select and run them through Duplicates(), and they all happen to be unique, then in the best case you'll have committed the set of all keys to memory by virtue of storing them in a hash set; all this to yield nothing. To do the same for Duplicates that takes a keySelector, you'll have to return the records whose keys are duplicated. In order to detect duplicates, you'll have to retain the initial record of each key. When you find that a key is duplicated, you'll have to yield that initial record and all subsequent records that belong to that duplicate key. If all records have unique keys, hopefully the best/optimistic case, you'll have committed all million records to memory. This is what I meant by surprisingly poor runtime characteristics in the best case. Usually, the key sizes will be smaller than a record size and so a simple Duplicates() over keys won't necessarily suffer from or exhibit the same memory profile. We also have to ask the question, when would one want to ever run Duplicates over records based on some field of a record? Would it help to report which records have those duplicate keys? If you have many of such records, then chances are you have duplicate records in the first place. It's a similar problem with DistinctBy. You generally run it on records that are fundamentally duplicated at the source and you're using a key to retain just one of the records of a duplicated key. This is why I think we can defer the unusual case to be considered at a later time that will hopefully never arrive.

Couple of other things missing:

MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
@julienasp julienasp requested a review from atifaziz November 15, 2023 00:50
Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up! I made another review pass. Let me know if you have questions or if something's unclear.

MoreLinq.Test/DuplicatesTest.cs Show resolved Hide resolved
MoreLinq/Duplicates.cs Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Show resolved Hide resolved
Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently breaking the build.

@julienasp
Copy link
Contributor Author

This is currently breaking the build.

i'll look at it

@julienasp julienasp requested a review from atifaziz November 18, 2023 19:17
Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done my full review now. There are just some changes needed in the tests.

We're nearly there to get this merged soon, 🏁 so appreciate all your follow-up!

MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq.Test/DuplicatesTest.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
MoreLinq/Duplicates.cs Outdated Show resolved Hide resolved
@atifaziz atifaziz changed the title Introduce duplicates extension Introduce Duplicates extension Nov 19, 2023
@atifaziz atifaziz added this to the 4.1.0 milestone Nov 19, 2023
Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julienasp LGTM! Thanks for working through the review, all the feedback, patience, your contribution and sticking around!

@atifaziz
Copy link
Member

Thank you also @leandromoh, @Orace and @viceroypenguin for your comments and reviews on #1001 and here.

@julienasp
Copy link
Contributor Author

@julienasp LGTM! Thanks for working through the review, all the feedback, patience, your contribution and sticking around!

thanks for the feedback, i learned alot in the process. Thanks also for the detailed explanations!

@atifaziz atifaziz merged commit 3a3c256 into morelinq:master Nov 19, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants