Implemented CountBy (Issue 104) #207

leandromoh · 2016-10-30T02:12:44Z

Implemented the F# countBy analogue (#104)

Currently missing to add method reference in "project.json" and "README.md" files.
I pretend to do it latter to avoid conflicts.

atifaziz · 2016-11-01T12:08:30Z

MoreLinq/CountBy.cs

+        /// Applies a key-generating function to each element of a sequence and returns a sequence of 
+        /// unique keys and their number of occurrences in the original sequence.
+        /// </summary>
+        /// <typeparam name="TSource">Type of the source sequence</typeparam>


It's not the type of the source sequence but the type of the elements of the source sequence. Ditto for the other overload.

Also, all XML summary comment are full sentences so consider terminating them with a period or full stop (.).

atifaziz · 2016-11-01T12:08:38Z

MoreLinq/CountBy.cs

+
+        /// <summary>
+        /// Applies a key-generating function to each element of a sequence and returns a sequence of 
+        /// unique keys and their number of occurrences in the original sequence.


This does not mention how this overload is any different that the other & therefore could look odd on a page of summaries. It should mention the additional argument. Considering adding:

An additional argument specifies a comparer to use for testing equivalence of keys.

atifaziz · 2016-11-01T12:08:41Z

MoreLinq/CountBy.cs

+
+            foreach (var item in source)
+            {
+                TKey key = keySelector(item);


Use var instead TKey.

atifaziz · 2016-11-01T12:08:43Z

MoreLinq/CountBy.cs

+            {
+                TKey key = keySelector(item);
+
+                if (dic.ContainsKey(key))


Consider collapsing this if into the simpler form:

dic[key] = dic.Contains(key) ? dic[key] + 1 : 1;

atifaziz · 2016-11-01T12:22:51Z

MoreLinq.Test/CountByTest.cs

+        [Test]
+        public void CountBySimpleTest()
+        {
+            IEnumerable<KeyValuePair<int, int>> result = new[] { 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 1, 2 }.CountBy(c => c);


Use var here and elsewhere instead of repeating the full type.

atifaziz · 2016-11-01T12:22:54Z

MoreLinq.Test/CountByTest.cs

+        {
+            IEnumerable<KeyValuePair<int, int>> result = new[] { 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 1, 2 }.CountBy(c => c);
+
+            IEnumerable<KeyValuePair<int, int>> expecteds = new Dictionary<int, int>() { { 1, 4 }, { 2, 3 }, { 3, 2 }, { 4, 1 }, { 5, 1 }, { 6, 1 } };


Replace expecteds with expectations because I think that's what you mean. Also, I would consider laying out the dictionary vertically for readability:

var expecations = new Dictionary<int, int> { [1] = 4, [2] = 3, [3] = 2, [4] = 1, [5] = 1, [6] = 1, };

atifaziz · 2016-11-01T12:29:47Z

MoreLinq.Test/CountByTest.cs

+
+            IEnumerable<KeyValuePair<int, int>> expecteds = new Dictionary<int, int>() { { 1, 4 }, { 2, 3 }, { 3, 2 }, { 4, 1 }, { 5, 1 }, { 6, 1 } };
+
+            result.AssertSequenceEqual(expecteds);


Using AssertSequenceEqual with dictionaries is dangerous. AssertSequenceEqual tests sequences & dictionaries do not guarantee enumeration in any particular or logical order. Here you are comparing a sequence (as logically returned by CountBy) with a dictionary. If the test is passing, it's really by fluke & therefore a false positive. If you want to use dictionaries, you must sort entries by key before using AssertSequenceEqual, e.g.:

results.OrderBy(e => e.Key).AssertSequenceEqual(expecteds).OrderBy(e => e.Key);

If you do the above, a comment in the code may be worthwhile.

atifaziz · 2016-11-01T12:35:11Z

It would be nice to attempt a version that, like GroupBy, maintains the order of the keys with respect to the source sequence. If it turns out too complicated then we can revert but it seems worth a try.

leandromoh · 2016-11-02T00:55:05Z

@atifaziz Thanks for the review! All adjusts were done.

atifaziz

Could you please rebase this over master? It looks like you added this on top of the 2.0 beta 2, which has a different project structure. Apologies I didn't notice this earlier.

P.S. Let me know if this is a problem because you don't have the tooling for .xproj.

atifaziz · 2016-11-02T07:25:44Z

MoreLinq/CountBy.cs

+                }
+            }
+
+            return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));


Can we add a test case to specifically test that keys are indeed returned in the order they were seen in the source? If we return dic directly, we should be able to see the test fail.

Done! check the CountByHasKeysOrderedLikeGroupBy test.
I replaced List by IList in the extension method too.

atifaziz · 2016-11-02T07:26:36Z

MoreLinq.Test/TestExtensions.cs

@@ -73,5 +73,9 @@ internal static IEnumerable<string> GenerateSplits(this string str, params char[
                yield return split;
        }

+        internal static void Add<TKey, TValue>(this List<KeyValuePair<TKey, TValue>> list, TKey key, TValue value)


If we are going to add this, might as well extend IList<KeyValuePair<TKey, TValue>> (interface) as opposed to List<KeyValuePair<TKey, TValue>> (implementation).

atifaziz · 2016-11-02T08:06:08Z

Could you please rebase this over master? It looks like you added this on top of the 2.0 beta 2, which has a different project structure.

Ignore this part. I was looking at a stale branch that was probably rewritten. Sorry.

atifaziz

There is a test missing to check that CountBy is lazy.

atifaziz · 2016-11-02T16:55:49Z

MoreLinq.Test/CountByTest.cs

+        }
+
+        [Test]
+        public void CountByHasKeysOrderedLikeGroupBy()


If I change the last line of CountByImpl:

return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));

to simply return dic instead:

return dic;

then this test (CountByHasKeysOrderedLikeGroupBy) doesn't fail.

If I change the last line of CountByImpl to simply return dic instead then this test (CountByHasKeysOrderedLikeGroupBy) doesn't fail.

In pratice, yes, it doesnt fail (because, currently, dictionary brings the results in order that keys were inserted), but as you once said yourself

dictionaries do not guarantee enumeration in any particular or logical order

So order in enumeration of dictionary can change anyday with a new implementation. To ensure CountBy always return in the order that keys were found the current last line is necessary.

There is a test missing to check that CountBy is lazy.

What CountByIsLazy test should exactly do? If I understood well, the implementation isn't lazy since I must iterate over all the IEnumerable before return any KeyValuePair.

CountBy isn't lazy but it can and should be (like GroupBy). The iterator doesn't need to run until someone iterates the results! You can make it lazy very easily by using yield to return the results. So instead of the following as the last line of CountByImpl:

return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));

Do instead:

foreach (var key in keys) yield return new KeyValuePair<TKey, int>(key, dic[key]);

Now the compiler will re-write the code to run during iteration, rendering it lazy!

But It isn't the same thing since Select is lazy (do exactly the foreach yield) ?

how could I write a test that check if the return of CountBy is lazy (for CountByIsLazy) ?

how could I write a test that check if the return of CountBy is lazy (for CountByIsLazy) ?

See some of the existing tests that test for laziness for inspiration.

But It isn't the same thing since Select is lazy (do exactly the foreach yield) ?

Yes, Select is lazy except you compute the result imperatively before you send them back via Select. That computation is happening when CountBy is called when it should happen when the enumerable returned by your method is enumerated by the caller (and which could be some time after).

I see, thanks for the explanation. CountByIsLazy was implemented.

atifaziz

Thanks for working through all the review. We're nearly there. 🏁

MoreLinq/CountBy.cs

atifaziz · 2016-11-02T21:15:59Z

MoreLinq/CountBy.cs

+            return CountByImpl(source, keySelector, comparer);
+        }
+
+        private static IEnumerable<KeyValuePair<TKey, int>> CountByImpl<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)


Now that we have all the tests in place, it would be good to work in some optimisations. The current implementation does too many lookups:

Key existence check: dic.ContainsKey(key)

Counter increment: dic[key]++

At the end when yielding back the results

Another optimisation could be is that as long as the current key is the same as the last, just increase the counter. This way, adjacent keys don't require containment tests on the dictionary.

atifaziz · 2016-11-03T12:46:02Z

I've taken a crack at implementing the optimizations I suggested. The tests are passing but feel free to review & let me know if you think we may need to add some tests to for corner cases where the optimization may fail.

Benchmarks

I measured the overall benefit of the optimizations. Here's the benchmark code (using BenchmarkDotNet) that runs a CountBy on an ordered and unordered sequence of 10K integers (with duplicates) as keys:

using System;
using System.Linq;
using BenchmarkDotNet.Attributes;
using MoreLinq;

public class Benchmark
{
    static readonly int[] UnorderedData;
    static readonly int[] OrderedData;

    static Benchmark()
    {
        var xs = Enumerable.Range(1, 1000)
                           .SelectMany(x => Enumerable.Repeat(x, 10))
                           .ToArray();
        OrderedData = xs;
        UnorderedData = xs.OrderBy(_ => Guid.NewGuid()).ToArray();
    }

    [Benchmark]
    public static void CountByOnOrderedKeys() =>
        OrderedData.CountBy(e => e).Consume();

    [Benchmark]
    public static void CountByOnUnorderedKeys() =>
        UnorderedData.CountBy(e => e).Consume();
}

Benchmark Environment

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-6600U CPU 2.60GHz, ProcessorCount=4
Frequency=2742185 ticks, Resolution=364.6727 ns, Timer=TSC
CLR=MS.NET 4.0.30319.42000, Arch=32-bit RELEASE
GC=Concurrent Workstation
JitModules=clrjit-v4.6.1080.0

Type=Benchmark  Mode=Throughput

Results without Optimizations

Method	Median	StdDev
CountByOnOrderedKeys	478.9549 us	45.6884 us
CountByOnUnorderedKeys	481.1692 us	29.9369 us

Results with Optimizations

Method	Median	StdDev
CountByOnOrderedKeys	210.2153 us	3.8459 us
CountByOnUnorderedKeys	334.2708 us	17.2886 us

Conclusion

The optimizations have improved the speed of CountBy! 🙌 In the unordered case (CountByOnUnorderedKeys), it runs 30% faster due to fewer dictionary look-ups. In the ordered case (CountByOnOrderedKeys), the improvement is 2 folds by avoiding look-ups as long as the key is not changing.

fsateler · 2016-11-03T14:52:25Z

MoreLinq/CountBy.cs

+            // is told not to try and inline; done so assuming that the above
+            // method could have been turned into a NOP (in theory).
+
+            Null(ref dic); // dic = null;


I think it would be less magic and more readable to simply split the counting loop to a separate function.

For other readers (and myself, I had to google this), setting dic = null is usually not useful, because the CLR knows that you are not using the variable anymore, and thus marks the referenced object as eligible for collection. However, the CLR cannot be sure when a generator or async method is used (I only found a passing reference by Stephen Cleary, is there a more detailed source?), and thus it waits until the method ends. I think this is because the generator method is compiled to a class, with all local parameters as fields on that class. If the instance of this generated class outlives the use of the fields in that class, the CLR cannot know this. However, this same analysis would suggest that it would be a bug in either the C# compiler or the JIT to turn dict = null into a NOP.

I was too terse I think: by split the counting loop into a separate function, I mean that that function runs eagerly, not lazily, thus avoiding this problem.

I think it would be less magic and more readable to simply split the counting loop to a separate function.

That's an excellent suggestion! 👍 It doesn't take away the magic one would need to understand that is the raison d'être for split. Someone in the future could say, “This is stupid. Let me inline this for simplicity's sake!”

analysis would suggest that it would be a bug in either the C# compiler or the JIT to turn dict = null into a NOP.

Yeah it would be, especially if it's been lifted into a field but it depends on how clever the compiler can get about tying the local variable lifetimes to various stages of the state machine it generates. Technically, the dic would only need to exist for the first state of the machine and therefore could be local to the case of the switch generated for the MoveNext machine. Only key, counts and i would need to extend over the rest of the states. However, I neither checked out the generated code (by the C# or JIT compiler) nor did I want to depend on it (today it's a switch/case, tomorrow it could be done in other ways). I guess I was trying to err on the safe side but scoping to another function altogether is cleaner.

leandromoh · 2016-11-05T14:26:10Z

@atifaziz you've done a great job with the optimization! I would begin it today but you were faster lol
I have only one doubt: why did you use

dic.Comparer.GetHashCode(prevKey) == dic.Comparer.GetHashCode(key) 
&& dic.Comparer.Equals(prevKey, key)

instead of just dic.Comparer.Equals(prevKey, key) ?

atifaziz · 2016-11-05T22:57:34Z

@leandromoh It's not worth calling Equals for performance reasons if the hash code is already different because that's what a dictionary would do internally too. Comparing hash codes can be faster in many cases. For example, Equals for a string will check for every character but a hash code comparison is just an integer equality check and therefore much faster (same goes for structural value types that are compounds of primitives). If hash codes match then a full equality comparison of the actual content is needed because the hash is just a digest & can be the same for several values of a type.

Since you're happy with the optimisations then I can merge this PR if you can just fix the type parameter reference doc issue.

atifaziz

Currently missing to add method reference in project.json and README.md files.

Can we address this now as the last item? Should be done then. Thanks!

leandromoh · 2016-11-10T23:48:13Z

Can we address this now as the last item? Should be done then. Thanks!

Done!

atifaziz · 2016-11-11T14:11:07Z

Merged into master by a02b8f9.

Thanks for adding this and working through all the reviews.

Note that I squashed your commits in the final merge except the optimizations because I felt it would be good to keep a change history of those in case anything needs to be reverted or reviewed if ever a bug is reported.

atifaziz · 2016-11-11T17:18:30Z

I know I said MoreLINQ is in lock down for 2.0 release but I'm being bad here and letting this one slip… 😇

CountBy is now published in 2.0 beta 6.

atifaziz · 2016-11-12T11:09:12Z

Documentation published.

leandromoh · 2016-11-12T18:18:20Z

@atifaziz that's a good new! When do you plan to publish the release 2.0 with the other methods?

I open the link to see CountBy documentation and perceive that my name is not in the footer with others contributors. When do you plan to add it ?

atifaziz · 2016-11-13T09:02:04Z

When do you plan to publish the release 2.0 with the other methods?

Soon-ish.

my name is not in the footer with others contributors.

Sorry, missed that one but fixed now by 58d6c62. BTW, that's an aggregate of copyright notices, not contributors list.

leandromoh added 2 commits October 29, 2016 16:47

countBy implemented

3588216

Tests implemented

987ef8c

atifaziz requested changes Nov 1, 2016

View reviewed changes

atifaziz added the enhancement label Nov 1, 2016

leandromoh added 3 commits November 1, 2016 22:44

adjusts CountBy

0526568

adjusts CountByTest

2da65a8

added List Add() extensions

db337ac

atifaziz requested changes Nov 2, 2016

View reviewed changes

leandromoh added 2 commits November 2, 2016 11:21

added CountByHasKeysOrderedLikeGroupBy

4fadb74

changed List to IList

6e218c8

atifaziz requested changes Nov 2, 2016

View reviewed changes

leandromoh added 2 commits November 2, 2016 16:51

adjust for become lazy

9bc8c95

added CountByIsLazy

3389416

atifaziz requested changes Nov 2, 2016

View reviewed changes

atifaziz added 3 commits November 3, 2016 08:08

Optimization of lookups in CountBy

74964a7

Clear the keys dictionary in CountBy before yielding

ae13491

Optimize CountBy for adjacent keys

0162049

fsateler reviewed Nov 3, 2016

View reviewed changes

Simpler CountBy iterator locals lifetime scoping

5e6808a

adjust on typeparamref

1e6b47a

atifaziz approved these changes Nov 9, 2016

View reviewed changes

atifaziz requested changes Nov 9, 2016

View reviewed changes

updated project.json and README.md

203c5db

atifaziz added a commit that referenced this pull request Nov 11, 2016

Merge pull request #207 from leandromoh/countBy

a02b8f9

atifaziz closed this Nov 11, 2016

atifaziz added merged and removed merged labels Mar 3, 2017

leandromoh deleted the countBy branch April 1, 2018 02:27

atifaziz mentioned this pull request Oct 26, 2022

Remove look-up optimizations in CountBy/ScanBy #855

Merged


		IEnumerable<KeyValuePair<int, int>> expecteds = new Dictionary<int, int>() { { 1, 4 }, { 2, 3 }, { 3, 2 }, { 4, 1 }, { 5, 1 }, { 6, 1 } };

		result.AssertSequenceEqual(expecteds);

Implemented CountBy (Issue 104) #207

Implemented CountBy (Issue 104) #207

Conversation

leandromoh commented Oct 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atifaziz commented Nov 1, 2016

leandromoh commented Nov 2, 2016

atifaziz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leandromoh Nov 2, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atifaziz commented Nov 2, 2016

atifaziz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leandromoh Nov 2, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leandromoh Nov 2, 2016 • edited Loading

Choose a reason for hiding this comment

leandromoh Nov 2, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atifaziz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atifaziz commented Nov 3, 2016 • edited Loading

Benchmarks

Benchmark Environment

Results without Optimizations

Results with Optimizations

Conclusion

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leandromoh commented Nov 5, 2016

atifaziz commented Nov 5, 2016

atifaziz left a comment

Choose a reason for hiding this comment

leandromoh commented Nov 10, 2016

atifaziz commented Nov 11, 2016

atifaziz commented Nov 11, 2016

atifaziz commented Nov 12, 2016

leandromoh commented Nov 12, 2016 • edited Loading

atifaziz commented Nov 13, 2016

leandromoh Nov 2, 2016 •

edited

Loading

leandromoh Nov 2, 2016 •

edited

Loading

leandromoh Nov 2, 2016 •

edited

Loading

leandromoh Nov 2, 2016 •

edited

Loading

atifaziz commented Nov 3, 2016 •

edited

Loading

leandromoh commented Nov 12, 2016 •

edited

Loading