Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented CountBy (Issue 104) #207

Closed
wants to merge 15 commits into from
Closed

Conversation

leandromoh
Copy link
Collaborator

Implemented the F# countBy analogue (#104)

Currently missing to add method reference in "project.json" and "README.md" files.
I pretend to do it latter to avoid conflicts.

/// Applies a key-generating function to each element of a sequence and returns a sequence of
/// unique keys and their number of occurrences in the original sequence.
/// </summary>
/// <typeparam name="TSource">Type of the source sequence</typeparam>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the type of the source sequence but the type of the elements of the source sequence. Ditto for the other overload.

Also, all XML summary comment are full sentences so consider terminating them with a period or full stop (.).


/// <summary>
/// Applies a key-generating function to each element of a sequence and returns a sequence of
/// unique keys and their number of occurrences in the original sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not mention how this overload is any different that the other & therefore could look odd on a page of summaries. It should mention the additional argument. Considering adding:

An additional argument specifies a comparer to use for testing equivalence of keys.


foreach (var item in source)
{
TKey key = keySelector(item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use var instead TKey.

{
TKey key = keySelector(item);

if (dic.ContainsKey(key))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider collapsing this if into the simpler form:

dic[key] = dic.Contains(key) ? dic[key] + 1 : 1;

[Test]
public void CountBySimpleTest()
{
IEnumerable<KeyValuePair<int, int>> result = new[] { 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 1, 2 }.CountBy(c => c);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use var here and elsewhere instead of repeating the full type.

{
IEnumerable<KeyValuePair<int, int>> result = new[] { 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 1, 2 }.CountBy(c => c);

IEnumerable<KeyValuePair<int, int>> expecteds = new Dictionary<int, int>() { { 1, 4 }, { 2, 3 }, { 3, 2 }, { 4, 1 }, { 5, 1 }, { 6, 1 } };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace expecteds with expectations because I think that's what you mean. Also, I would consider laying out the dictionary vertically for readability:

var expecations = new Dictionary<int, int>
{
    [1] = 4,
    [2] = 3,
    [3] = 2,
    [4] = 1,
    [5] = 1,
    [6] = 1,
};


IEnumerable<KeyValuePair<int, int>> expecteds = new Dictionary<int, int>() { { 1, 4 }, { 2, 3 }, { 3, 2 }, { 4, 1 }, { 5, 1 }, { 6, 1 } };

result.AssertSequenceEqual(expecteds);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using AssertSequenceEqual with dictionaries is dangerous. AssertSequenceEqual tests sequences & dictionaries do not guarantee enumeration in any particular or logical order. Here you are comparing a sequence (as logically returned by CountBy) with a dictionary. If the test is passing, it's really by fluke & therefore a false positive. If you want to use dictionaries, you must sort entries by key before using AssertSequenceEqual, e.g.:

results.OrderBy(e => e.Key).AssertSequenceEqual(expecteds).OrderBy(e => e.Key);

If you do the above, a comment in the code may be worthwhile.

@atifaziz
Copy link
Member

atifaziz commented Nov 1, 2016

It would be nice to attempt a version that, like GroupBy, maintains the order of the keys with respect to the source sequence. If it turns out too complicated then we can revert but it seems worth a try.

@leandromoh
Copy link
Collaborator Author

@atifaziz Thanks for the review! All adjusts were done.

Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please rebase this over master? It looks like you added this on top of the 2.0 beta 2, which has a different project structure. Apologies I didn't notice this earlier.

P.S. Let me know if this is a problem because you don't have the tooling for .xproj.

}
}

return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case to specifically test that keys are indeed returned in the order they were seen in the source? If we return dic directly, we should be able to see the test fail.

Copy link
Collaborator Author

@leandromoh leandromoh Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! check the CountByHasKeysOrderedLikeGroupBy test.
I replaced List by IList in the extension method too.

@@ -73,5 +73,9 @@ internal static IEnumerable<string> GenerateSplits(this string str, params char[
yield return split;
}

internal static void Add<TKey, TValue>(this List<KeyValuePair<TKey, TValue>> list, TKey key, TValue value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to add this, might as well extend IList<KeyValuePair<TKey, TValue>> (interface) as opposed to List<KeyValuePair<TKey, TValue>> (implementation).

@atifaziz
Copy link
Member

atifaziz commented Nov 2, 2016

Could you please rebase this over master? It looks like you added this on top of the 2.0 beta 2, which has a different project structure.

Ignore this part. I was looking at a stale branch that was probably rewritten. Sorry.

Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test missing to check that CountBy is lazy.

}

[Test]
public void CountByHasKeysOrderedLikeGroupBy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change the last line of CountByImpl:

             return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));

to simply return dic instead:

             return dic;

then this test (CountByHasKeysOrderedLikeGroupBy) doesn't fail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change the last line of CountByImpl to simply return dic instead then this test (CountByHasKeysOrderedLikeGroupBy) doesn't fail.

In pratice, yes, it doesnt fail (because, currently, dictionary brings the results in order that keys were inserted), but as you once said yourself

dictionaries do not guarantee enumeration in any particular or logical order

So order in enumeration of dictionary can change anyday with a new implementation. To ensure CountBy always return in the order that keys were found the current last line is necessary.

Copy link
Collaborator Author

@leandromoh leandromoh Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test missing to check that CountBy is lazy.

What CountByIsLazy test should exactly do? If I understood well, the implementation isn't lazy since I must iterate over all the IEnumerable before return any KeyValuePair.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CountBy isn't lazy but it can and should be (like GroupBy). The iterator doesn't need to run until someone iterates the results! You can make it lazy very easily by using yield to return the results. So instead of the following as the last line of CountByImpl:

            return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));

Do instead:

            foreach (var key in keys)
                yield return new KeyValuePair<TKey, int>(key, dic[key]);

Now the compiler will re-write the code to run during iteration, rendering it lazy!

Copy link
Collaborator Author

@leandromoh leandromoh Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But It isn't the same thing since Select is lazy (do exactly the foreach yield) ?

Copy link
Collaborator Author

@leandromoh leandromoh Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could I write a test that check if the return of CountBy is lazy (for CountByIsLazy) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could I write a test that check if the return of CountBy is lazy (for CountByIsLazy) ?

See some of the existing tests that test for laziness for inspiration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But It isn't the same thing since Select is lazy (do exactly the foreach yield) ?

Yes, Select is lazy except you compute the result imperatively before you send them back via Select. That computation is happening when CountBy is called when it should happen when the enumerable returned by your method is enumerated by the caller (and which could be some time after).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation. CountByIsLazy was implemented.

Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through all the review. We're nearly there. 🏁

MoreLinq/CountBy.cs Show resolved Hide resolved
return CountByImpl(source, keySelector, comparer);
}

private static IEnumerable<KeyValuePair<TKey, int>> CountByImpl<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have all the tests in place, it would be good to work in some optimisations. The current implementation does too many lookups:

  1. Key existence check: dic.ContainsKey(key)
  2. Counter increment: dic[key]++
  3. At the end when yielding back the results

Another optimisation could be is that as long as the current key is the same as the last, just increase the counter. This way, adjacent keys don't require containment tests on the dictionary.

@atifaziz
Copy link
Member

atifaziz commented Nov 3, 2016

I've taken a crack at implementing the optimizations I suggested. The tests are passing but feel free to review & let me know if you think we may need to add some tests to for corner cases where the optimization may fail.

Benchmarks

I measured the overall benefit of the optimizations. Here's the benchmark code (using BenchmarkDotNet) that runs a CountBy on an ordered and unordered sequence of 10K integers (with duplicates) as keys:

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using MoreLinq;

public class Benchmark
{
    static readonly int[] UnorderedData;
    static readonly int[] OrderedData;

    static Benchmark()
    {
        var xs = Enumerable.Range(1, 1000)
                           .SelectMany(x => Enumerable.Repeat(x, 10))
                           .ToArray();
        OrderedData = xs;
        UnorderedData = xs.OrderBy(_ => Guid.NewGuid()).ToArray();
    }

    [Benchmark]
    public static void CountByOnOrderedKeys() =>
        OrderedData.CountBy(e => e).Consume();

    [Benchmark]
    public static void CountByOnUnorderedKeys() =>
        UnorderedData.CountBy(e => e).Consume();
}

Benchmark Environment

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-6600U CPU 2.60GHz, ProcessorCount=4
Frequency=2742185 ticks, Resolution=364.6727 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=32-bit RELEASE
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0

Type=Benchmark  Mode=Throughput  

Results without Optimizations

Method Median StdDev
CountByOnOrderedKeys 478.9549 us 45.6884 us
CountByOnUnorderedKeys 481.1692 us 29.9369 us

Results with Optimizations

Method Median StdDev
CountByOnOrderedKeys 210.2153 us 3.8459 us
CountByOnUnorderedKeys 334.2708 us 17.2886 us

Conclusion

The optimizations have improved the speed of CountBy! 🙌 In the unordered case (CountByOnUnorderedKeys), it runs 30% faster due to fewer dictionary look-ups. In the ordered case (CountByOnOrderedKeys), the improvement is 2 folds by avoiding look-ups as long as the key is not changing.

// is told not to try and inline; done so assuming that the above
// method could have been turned into a NOP (in theory).

Null(ref dic); // dic = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less magic and more readable to simply split the counting loop to a separate function.

For other readers (and myself, I had to google this), setting dic = null is usually not useful, because the CLR knows that you are not using the variable anymore, and thus marks the referenced object as eligible for collection. However, the CLR cannot be sure when a generator or async method is used (I only found a passing reference by Stephen Cleary, is there a more detailed source?), and thus it waits until the method ends. I think this is because the generator method is compiled to a class, with all local parameters as fields on that class. If the instance of this generated class outlives the use of the fields in that class, the CLR cannot know this. However, this same analysis would suggest that it would be a bug in either the C# compiler or the JIT to turn dict = null into a NOP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too terse I think: by split the counting loop into a separate function, I mean that that function runs eagerly, not lazily, thus avoiding this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less magic and more readable to simply split the counting loop to a separate function.

That's an excellent suggestion! 👍 It doesn't take away the magic one would need to understand that is the raison d'être for split. Someone in the future could say, “This is stupid. Let me inline this for simplicity's sake!”

analysis would suggest that it would be a bug in either the C# compiler or the JIT to turn dict = null into a NOP.

Yeah it would be, especially if it's been lifted into a field but it depends on how clever the compiler can get about tying the local variable lifetimes to various stages of the state machine it generates. Technically, the dic would only need to exist for the first state of the machine and therefore could be local to the case of the switch generated for the MoveNext machine. Only key, counts and i would need to extend over the rest of the states. However, I neither checked out the generated code (by the C# or JIT compiler) nor did I want to depend on it (today it's a switch/case, tomorrow it could be done in other ways). I guess I was trying to err on the safe side but scoping to another function altogether is cleaner.

@leandromoh
Copy link
Collaborator Author

@atifaziz you've done a great job with the optimization! I would begin it today but you were faster lol
I have only one doubt: why did you use

dic.Comparer.GetHashCode(prevKey) == dic.Comparer.GetHashCode(key) 
&& dic.Comparer.Equals(prevKey, key)

instead of just dic.Comparer.Equals(prevKey, key) ?

@atifaziz
Copy link
Member

atifaziz commented Nov 5, 2016

@leandromoh It's not worth calling Equals for performance reasons if the hash code is already different because that's what a dictionary would do internally too. Comparing hash codes can be faster in many cases. For example, Equals for a string will check for every character but a hash code comparison is just an integer equality check and therefore much faster (same goes for structural value types that are compounds of primitives). If hash codes match then a full equality comparison of the actual content is needed because the hash is just a digest & can be the same for several values of a type.

Since you're happy with the optimisations then I can merge this PR if you can just fix the type parameter reference doc issue.

Copy link
Member

@atifaziz atifaziz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently missing to add method reference in project.json and README.md files.

Can we address this now as the last item? Should be done then. Thanks!

@leandromoh
Copy link
Collaborator Author

Can we address this now as the last item? Should be done then. Thanks!

Done!

atifaziz added a commit that referenced this pull request Nov 11, 2016
@atifaziz
Copy link
Member

Merged into master by a02b8f9.

Thanks for adding this and working through all the reviews.


Note that I squashed your commits in the final merge except the optimizations because I felt it would be good to keep a change history of those in case anything needs to be reverted or reviewed if ever a bug is reported.

@atifaziz atifaziz closed this Nov 11, 2016
@atifaziz
Copy link
Member

I know I said MoreLINQ is in lock down for 2.0 release but I'm being bad here and letting this one slip… 😇

CountBy is now published in 2.0 beta 6.

@atifaziz
Copy link
Member

Documentation published.

@leandromoh
Copy link
Collaborator Author

leandromoh commented Nov 12, 2016

@atifaziz that's a good new! When do you plan to publish the release 2.0 with the other methods?

I open the link to see CountBy documentation and perceive that my name is not in the footer with others contributors. When do you plan to add it ?

@atifaziz
Copy link
Member

When do you plan to publish the release 2.0 with the other methods?

Soon-ish.

my name is not in the footer with others contributors.

Sorry, missed that one but fixed now by 58d6c62. BTW, that's an aggregate of copyright notices, not contributors list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants