Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented CountBy (Issue 104) #207

Closed
wants to merge 15 commits into from
109 changes: 109 additions & 0 deletions MoreLinq.Test/CountByTest.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
#region License and Terms
// MoreLINQ - Extensions to LINQ to Objects
// Copyright (c) 2016 Leandro F. Vieira (leandromoh). All rights reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#endregion

using NUnit.Framework;
using System;
using System.Collections.Generic;
using System.Linq;
using LinqEnumerable = System.Linq.Enumerable;

namespace MoreLinq.Test
{
[TestFixture]
public class CountByTest
{
[Test]
[ExpectedException(typeof(ArgumentNullException))]
public void CountByWithNullSequence()
{
IEnumerable<int> sequence = null;
sequence.CountBy(x => x % 2 == 0);
}

[Test]
[ExpectedException(typeof(ArgumentNullException))]
public void CountByWithNullProjection()
{
Func<int, bool> projection = null;
Enumerable.Range(1, 10).CountBy(projection);
}

[Test]
public void CountBySimpleTest()
{
var result = new[] { 1, 2, 3, 4, 5, 6, 1, 2, 3, 1, 1, 2 }.CountBy(c => c);

var expectations = new List<KeyValuePair<int, int>>()
{
{ 1, 4 },
{ 2, 3 },
{ 3, 2 },
{ 4, 1 },
{ 5, 1 },
{ 6, 1 },
};

result.AssertSequenceEqual(expectations);
}

[Test]
public void CountByEvenOddTest()
{
var result = Enumerable.Range(1, 100).CountBy(c => c % 2);

var expectations = new List<KeyValuePair<int, int>>()
{
{ 1, 50 },
{ 0, 50 },
};

result.AssertSequenceEqual(expectations);
}

[Test]
public void CountByWithEqualityComparer()
{
var result = new[] { "a", "B", "c", "A", "b", "A" }.CountBy(c => c, StringComparer.OrdinalIgnoreCase);

var expectations = new List<KeyValuePair<string, int>>()
{
{ "a", 3 },
{ "B", 2 },
{ "c", 1 },
};

result.AssertSequenceEqual(expectations);
}

[Test]
public void CountByHasKeysOrderedLikeGroupBy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change the last line of CountByImpl:

             return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));

to simply return dic instead:

             return dic;

then this test (CountByHasKeysOrderedLikeGroupBy) doesn't fail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I change the last line of CountByImpl to simply return dic instead then this test (CountByHasKeysOrderedLikeGroupBy) doesn't fail.

In pratice, yes, it doesnt fail (because, currently, dictionary brings the results in order that keys were inserted), but as you once said yourself

dictionaries do not guarantee enumeration in any particular or logical order

So order in enumeration of dictionary can change anyday with a new implementation. To ensure CountBy always return in the order that keys were found the current last line is necessary.

Copy link
Collaborator Author

@leandromoh leandromoh Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a test missing to check that CountBy is lazy.

What CountByIsLazy test should exactly do? If I understood well, the implementation isn't lazy since I must iterate over all the IEnumerable before return any KeyValuePair.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CountBy isn't lazy but it can and should be (like GroupBy). The iterator doesn't need to run until someone iterates the results! You can make it lazy very easily by using yield to return the results. So instead of the following as the last line of CountByImpl:

            return keys.Select(k => new KeyValuePair<TKey, int>(k, dic[k]));

Do instead:

            foreach (var key in keys)
                yield return new KeyValuePair<TKey, int>(key, dic[key]);

Now the compiler will re-write the code to run during iteration, rendering it lazy!

Copy link
Collaborator Author

@leandromoh leandromoh Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But It isn't the same thing since Select is lazy (do exactly the foreach yield) ?

Copy link
Collaborator Author

@leandromoh leandromoh Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could I write a test that check if the return of CountBy is lazy (for CountByIsLazy) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could I write a test that check if the return of CountBy is lazy (for CountByIsLazy) ?

See some of the existing tests that test for laziness for inspiration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But It isn't the same thing since Select is lazy (do exactly the foreach yield) ?

Yes, Select is lazy except you compute the result imperatively before you send them back via Select. That computation is happening when CountBy is called when it should happen when the enumerable returned by your method is enumerated by the caller (and which could be some time after).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation. CountByIsLazy was implemented.

{
var randomSequence = MoreEnumerable.Random(0, 100).Take(100).ToArray();

var countByKeys = randomSequence.CountBy(x => x).Select(x => x.Key);
var groupByKeys = randomSequence.GroupBy(x => x).Select(x => x.Key);

countByKeys.AssertSequenceEqual(groupByKeys);
}

[Test]
public void CountByIsLazy()
{
new BreakingSequence<string>().CountBy(x => x.Length);
}
}
}
4 changes: 4 additions & 0 deletions MoreLinq.Test/TestExtensions.cs
Original file line number Diff line number Diff line change
Expand Up @@ -73,5 +73,9 @@ internal static IEnumerable<string> GenerateSplits(this string str, params char[
yield return split;
}

internal static void Add<TKey, TValue>(this IList<KeyValuePair<TKey, TValue>> list, TKey key, TValue value)
{
list.Add(new KeyValuePair<TKey, TValue>(key, value));
}
}
}
115 changes: 115 additions & 0 deletions MoreLinq/CountBy.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
#region License and Terms
// MoreLINQ - Extensions to LINQ to Objects
// Copyright (c) 2016 Leandro F. Vieira (leandromoh). All rights reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
#endregion

namespace MoreLinq
{
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;

static partial class MoreEnumerable
{
/// <summary>
/// Applies a key-generating function to each element of a sequence and returns a sequence of
/// unique keys and their number of occurrences in the original sequence.
/// </summary>
/// <typeparam name="TSource">Type of the elements of the source sequence.</typeparam>
/// <typeparam name="TKey">Type of the projected element.</typeparam>
/// <param name="source">Source sequence.</param>
/// <param name="keySelector">Function that transforms each item of source sequence into a key to be compared against the others.</param>
/// <returns>A sequence of unique keys and their number of occurrences in the original sequence.</returns>
public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector)
{
return source.CountBy(keySelector, null);
}

/// <summary>
/// Applies a key-generating function to each element of a sequence and returns a sequence of
/// unique keys and their number of occurrences in the original sequence.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not mention how this overload is any different that the other & therefore could look odd on a page of summaries. It should mention the additional argument. Considering adding:

An additional argument specifies a comparer to use for testing equivalence of keys.

/// An additional argument specifies a comparer to use for testing equivalence of keys.
/// </summary>
/// <typeparam name="TSource">Type of the elements of the source sequence.</typeparam>
/// <typeparam name="TKey">Type of the projected element.</typeparam>
/// <param name="source">Source sequence.</param>
/// <param name="keySelector">Function that transforms each item of source sequence into a key to be compared against the others.</param>
/// <param name="comparer">The equality comparer to use to determine whether or not keys are equal.
/// If null, the default equality comparer for <c>TSource</c> is used.</param>
atifaziz marked this conversation as resolved.
Show resolved Hide resolved
/// <returns>A sequence of unique keys and their number of occurrences in the original sequence.</returns>
public static IEnumerable<KeyValuePair<TKey, int>> CountBy<TSource, TKey>(this IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)
{
if (source == null) throw new ArgumentNullException("source");
if (keySelector == null) throw new ArgumentNullException("keySelector");

return CountByImpl(source, keySelector, comparer);
}

private static IEnumerable<KeyValuePair<TKey, int>> CountByImpl<TSource, TKey>(IEnumerable<TSource> source, Func<TSource, TKey> keySelector, IEqualityComparer<TKey> comparer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have all the tests in place, it would be good to work in some optimisations. The current implementation does too many lookups:

  1. Key existence check: dic.ContainsKey(key)
  2. Counter increment: dic[key]++
  3. At the end when yielding back the results

Another optimisation could be is that as long as the current key is the same as the last, just increase the counter. This way, adjacent keys don't require containment tests on the dictionary.

{
var dic = new Dictionary<TKey, int>(comparer);
var keys = new List<TKey>();
var counts = new List<int>();
var havePrevKey = false;
var prevKey = default(TKey);
var index = 0;

foreach (var item in source)
{
var key = keySelector(item);

if (// key same as the previous? then re-use the index
(havePrevKey && dic.Comparer.GetHashCode(prevKey) == dic.Comparer.GetHashCode(key)
&& dic.Comparer.Equals(prevKey, key))
// otherwise try & find index of the key
|| dic.TryGetValue(key, out index))
{
counts[index]++;
}
else
{
dic[key] = keys.Count;
keys.Add(key);
counts.Add(1);
}

prevKey = key;
havePrevKey = true;
}

// The dictionary is no longer needed from this point forward so
// lose the reference and make it available as food for the GC.
// This optimization is designed to help cases where a slow running
// loop over the yielded pairs could span GC cycles. However,
// instead of doing simply the following:
//
// dic = null;
//
// the reference is nulled through a method that the JIT compiler
// is told not to try and inline; done so assuming that the above
// method could have been turned into a NOP (in theory).

Null(ref dic); // dic = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less magic and more readable to simply split the counting loop to a separate function.

For other readers (and myself, I had to google this), setting dic = null is usually not useful, because the CLR knows that you are not using the variable anymore, and thus marks the referenced object as eligible for collection. However, the CLR cannot be sure when a generator or async method is used (I only found a passing reference by Stephen Cleary, is there a more detailed source?), and thus it waits until the method ends. I think this is because the generator method is compiled to a class, with all local parameters as fields on that class. If the instance of this generated class outlives the use of the fields in that class, the CLR cannot know this. However, this same analysis would suggest that it would be a bug in either the C# compiler or the JIT to turn dict = null into a NOP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too terse I think: by split the counting loop into a separate function, I mean that that function runs eagerly, not lazily, thus avoiding this problem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be less magic and more readable to simply split the counting loop to a separate function.

That's an excellent suggestion! 👍 It doesn't take away the magic one would need to understand that is the raison d'être for split. Someone in the future could say, “This is stupid. Let me inline this for simplicity's sake!”

analysis would suggest that it would be a bug in either the C# compiler or the JIT to turn dict = null into a NOP.

Yeah it would be, especially if it's been lifted into a field but it depends on how clever the compiler can get about tying the local variable lifetimes to various stages of the state machine it generates. Technically, the dic would only need to exist for the first state of the machine and therefore could be local to the case of the switch generated for the MoveNext machine. Only key, counts and i would need to extend over the rest of the states. However, I neither checked out the generated code (by the C# or JIT compiler) nor did I want to depend on it (today it's a switch/case, tomorrow it could be done in other ways). I guess I was trying to err on the safe side but scoping to another function altogether is cleaner.


for (var i = 0; i < keys.Count; i++)
yield return new KeyValuePair<TKey, int>(keys[i], counts[i]);
}

// ReSharper disable once RedundantAssignment
[MethodImpl(MethodImplOptions.NoInlining)]
static void Null<T>(ref T obj) where T : class { obj = null; }
}
}