-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implemented CountBy (Issue 104) #207
Conversation
/// Applies a key-generating function to each element of a sequence and returns a sequence of | ||
/// unique keys and their number of occurrences in the original sequence. | ||
/// </summary> | ||
/// <typeparam name="TSource">Type of the source sequence</typeparam> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not the type of the source sequence but the type of the elements of the source sequence. Ditto for the other overload.
Also, all XML summary comment are full sentences so consider terminating them with a period or full stop (.
).
|
||
/// <summary> | ||
/// Applies a key-generating function to each element of a sequence and returns a sequence of | ||
/// unique keys and their number of occurrences in the original sequence. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not mention how this overload is any different that the other & therefore could look odd on a page of summaries. It should mention the additional argument. Considering adding:
An additional argument specifies a comparer to use for testing equivalence of keys.
|
||
foreach (var item in source) | ||
{ | ||
TKey key = keySelector(item); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use var
instead TKey
.
{ | ||
TKey key = keySelector(item); | ||
|
||
if (dic.ContainsKey(key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider collapsing this if
into the simpler form:
dic[key] = dic.Contains(key) ? dic[key] + 1 : 1;
[Test] | ||
public void CountBySimpleTest() | ||
{ | ||
IEnumerable<KeyValuePair<int, int>> result = new[] { 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 1, 2 }.CountBy(c => c); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use var
here and elsewhere instead of repeating the full type.
{ | ||
IEnumerable<KeyValuePair<int, int>> result = new[] { 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 1, 2 }.CountBy(c => c); | ||
|
||
IEnumerable<KeyValuePair<int, int>> expecteds = new Dictionary<int, int>() { { 1, 4 }, { 2, 3 }, { 3, 2 }, { 4, 1 }, { 5, 1 }, { 6, 1 } }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace expecteds
with expectations
because I think that's what you mean. Also, I would consider laying out the dictionary vertically for readability:
var expecations = new Dictionary<int, int>
{
[1] = 4,
[2] = 3,
[3] = 2,
[4] = 1,
[5] = 1,
[6] = 1,
};
|
||
IEnumerable<KeyValuePair<int, int>> expecteds = new Dictionary<int, int>() { { 1, 4 }, { 2, 3 }, { 3, 2 }, { 4, 1 }, { 5, 1 }, { 6, 1 } }; | ||
|
||
result.AssertSequenceEqual(expecteds); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using AssertSequenceEqual
with dictionaries is dangerous. AssertSequenceEqual
tests sequences & dictionaries do not guarantee enumeration in any particular or logical order. Here you are comparing a sequence (as logically returned by CountBy
) with a dictionary. If the test is passing, it's really by fluke & therefore a false positive. If you want to use dictionaries, you must sort entries by key before using AssertSequenceEqual
, e.g.:
results.OrderBy(e => e.Key).AssertSequenceEqual(expecteds).OrderBy(e => e.Key);
If you do the above, a comment in the code may be worthwhile.
It would be nice to attempt a version that, like |
@atifaziz Thanks for the review! All adjusts were done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please rebase this over master? It looks like you added this on top of the 2.0 beta 2, which has a different project structure. Apologies I didn't notice this earlier.
P.S. Let me know if this is a problem because you don't have the tooling for .xproj
.
} | ||
} | ||
|
||
return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case to specifically test that keys are indeed returned in the order they were seen in the source? If we return dic
directly, we should be able to see the test fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! check the CountByHasKeysOrderedLikeGroupBy test.
I replaced List by IList in the extension method too.
@@ -73,5 +73,9 @@ internal static IEnumerable<string> GenerateSplits(this string str, params char[ | |||
yield return split; | |||
} | |||
|
|||
internal static void Add<TKey, TValue>(this List<KeyValuePair<TKey, TValue>> list, TKey key, TValue value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to add this, might as well extend IList<KeyValuePair<TKey, TValue>>
(interface) as opposed to List<KeyValuePair<TKey, TValue>>
(implementation).
Ignore this part. I was looking at a stale branch that was probably rewritten. Sorry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a test missing to check that CountBy
is lazy.
} | ||
|
||
[Test] | ||
public void CountByHasKeysOrderedLikeGroupBy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change the last line of CountByImpl
:
return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));
to simply return dic
instead:
return dic;
then this test (CountByHasKeysOrderedLikeGroupBy
) doesn't fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I change the last line of CountByImpl to simply return dic instead then this test (CountByHasKeysOrderedLikeGroupBy) doesn't fail.
In pratice, yes, it doesnt fail (because, currently, dictionary brings the results in order that keys were inserted), but as you once said yourself
dictionaries do not guarantee enumeration in any particular or logical order
So order in enumeration of dictionary can change anyday with a new implementation. To ensure CountBy always return in the order that keys were found the current last line is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a test missing to check that CountBy is lazy.
What CountByIsLazy
test should exactly do? If I understood well, the implementation isn't lazy since I must iterate over all the IEnumerable before return any KeyValuePair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CountBy
isn't lazy but it can and should be (like GroupBy
). The iterator doesn't need to run until someone iterates the results! You can make it lazy very easily by using yield
to return the results. So instead of the following as the last line of CountByImpl
:
return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));
Do instead:
foreach (var key in keys)
yield return new KeyValuePair<TKey, int>(key, dic[key]);
Now the compiler will re-write the code to run during iteration, rendering it lazy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But It isn't the same thing since Select is lazy (do exactly the foreach yield
) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how could I write a test that check if the return of CountBy is lazy (for CountByIsLazy) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how could I write a test that check if the return of CountBy is lazy (for CountByIsLazy) ?
See some of the existing tests that test for laziness for inspiration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But It isn't the same thing since Select is lazy (do exactly the
foreach yield
) ?
Yes, Select
is lazy except you compute the result imperatively before you send them back via Select
. That computation is happening when CountBy
is called when it should happen when the enumerable returned by your method is enumerated by the caller (and which could be some time after).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for the explanation. CountByIsLazy
was implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through all the review. We're nearly there. 🏁
return CountByImpl(source, keySelector, comparer); | ||
} | ||
|
||
private static IEnumerable<KeyValuePair<TKey, int>> CountByImpl<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have all the tests in place, it would be good to work in some optimisations. The current implementation does too many lookups:
- Key existence check:
dic.ContainsKey(key)
- Counter increment:
dic[key]++
- At the end when yielding back the results
Another optimisation could be is that as long as the current key is the same as the last, just increase the counter. This way, adjacent keys don't require containment tests on the dictionary.
I've taken a crack at implementing the optimizations I suggested. The tests are passing but feel free to review & let me know if you think we may need to add some tests to for corner cases where the optimization may fail. BenchmarksI measured the overall benefit of the optimizations. Here's the benchmark code (using BenchmarkDotNet) that runs a using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using MoreLinq;
public class Benchmark
{
static readonly int[] UnorderedData;
static readonly int[] OrderedData;
static Benchmark()
{
var xs = Enumerable.Range(1, 1000)
.SelectMany(x => Enumerable.Repeat(x, 10))
.ToArray();
OrderedData = xs;
UnorderedData = xs.OrderBy(_ => Guid.NewGuid()).ToArray();
}
[Benchmark]
public static void CountByOnOrderedKeys() =>
OrderedData.CountBy(e => e).Consume();
[Benchmark]
public static void CountByOnUnorderedKeys() =>
UnorderedData.CountBy(e => e).Consume();
} Benchmark Environment
Results without Optimizations
Results with Optimizations
ConclusionThe optimizations have improved the speed of |
// is told not to try and inline; done so assuming that the above | ||
// method could have been turned into a NOP (in theory). | ||
|
||
Null(ref dic); // dic = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be less magic and more readable to simply split the counting loop to a separate function.
For other readers (and myself, I had to google this), setting dic = null
is usually not useful, because the CLR knows that you are not using the variable anymore, and thus marks the referenced object as eligible for collection. However, the CLR cannot be sure when a generator or async method is used (I only found a passing reference by Stephen Cleary, is there a more detailed source?), and thus it waits until the method ends. I think this is because the generator method is compiled to a class, with all local parameters as fields on that class. If the instance of this generated class outlives the use of the fields in that class, the CLR cannot know this. However, this same analysis would suggest that it would be a bug in either the C# compiler or the JIT to turn dict = null
into a NOP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was too terse I think: by split the counting loop into a separate function, I mean that that function runs eagerly, not lazily, thus avoiding this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be less magic and more readable to simply split the counting loop to a separate function.
That's an excellent suggestion! 👍 It doesn't take away the magic one would need to understand that is the raison d'être for split. Someone in the future could say, “This is stupid. Let me inline this for simplicity's sake!”
analysis would suggest that it would be a bug in either the C# compiler or the JIT to turn
dict = null
into a NOP.
Yeah it would be, especially if it's been lifted into a field but it depends on how clever the compiler can get about tying the local variable lifetimes to various stages of the state machine it generates. Technically, the dic
would only need to exist for the first state of the machine and therefore could be local to the case
of the switch
generated for the MoveNext
machine. Only key
, counts
and i
would need to extend over the rest of the states. However, I neither checked out the generated code (by the C# or JIT compiler) nor did I want to depend on it (today it's a switch
/case
, tomorrow it could be done in other ways). I guess I was trying to err on the safe side but scoping to another function altogether is cleaner.
@atifaziz you've done a great job with the optimization! I would begin it today but you were faster lol
instead of just |
@leandromoh It's not worth calling Since you're happy with the optimisations then I can merge this PR if you can just fix the type parameter reference doc issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently missing to add method reference in
project.json
andREADME.md
files.
Can we address this now as the last item? Should be done then. Thanks!
Done! |
Merged into Thanks for adding this and working through all the reviews. Note that I squashed your commits in the final merge except the optimizations because I felt it would be good to keep a change history of those in case anything needs to be reverted or reviewed if ever a bug is reported. |
I know I said MoreLINQ is in lock down for 2.0 release but I'm being bad here and letting this one slip… 😇
|
@atifaziz that's a good new! When do you plan to publish the release 2.0 with the other methods? I open the link to see CountBy documentation and perceive that my name is not in the footer with others contributors. When do you plan to add it ? |
Soon-ish.
Sorry, missed that one but fixed now by 58d6c62. BTW, that's an aggregate of copyright notices, not contributors list. |
Implemented the F# countBy analogue (#104)
Currently missing to add method reference in "project.json" and "README.md" files.
I pretend to do it latter to avoid conflicts.