Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add commonly required Enumerable methods (DistinctBy, ExceptBy, AsChunked, ...) #14753

Closed
GSPP opened this issue Jun 24, 2015 · 42 comments
Closed
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Linq
Milestone

Comments

@GSPP
Copy link

GSPP commented Jun 24, 2015

We have Distinct, Except, Intersect, Union, etc. Often it is required, though, to base the set operation on some key. OrderBy can do that for example.

I request that the following methods be added:

  • DistinctBy (pick the first element deterministically for each key)
  • ExceptBy (return all items of the first sequence that do not have a matching key in the second)
  • IntersectBy (return the first item in a.Concat(b) for each unique key where the key appears in both collections)
  • UnionBy (return the first item in a.Concat(b) for each unique key)
  • MaxBy/MinBy

The set operations need to deal with duplicate keys in a sensible way. I have added to that list my idea for how to do that in sane and well-defined way.


Another commonly required helper method is one that splits a sequence into chunks of a fixed size:

static IEnumerable<IEnumerable<T>> AsChunked<T>(this IEnumerable<T> sequence, int chunkSize)

This is useful for many things, for example:

  • Split a list into pages
  • Make AsParallel operate on coarser items by feeding it chunks
  • If you want to save a huge stream of items to files you can use this to split the sequence into one chunk per file
  • Request rows from the database in chunks of 1000 IDs

This requirement comes up on Stack Overflow a lot and there is no built-in way to do it.

The chunks should be drawn lazily. In fact the items of each chunk should be lazy as well. This allows for very big chunks such as 1 billion items per chunk.

Maybe we need two variants:

  1. IEnumerable<IEnumerable<T>> AsChunkedLazy
  2. IEnumerable<T[]> AsChunkedEager

The second one would be easier to handle for consumers. The first one requires more disciplined use because each chunk must be enumerated at most once and the chunks must be enumerated in order. This is required for an efficient streaming operation.

@omariom
Copy link
Contributor

omariom commented Jun 24, 2015

👍 As I had to implement many of ..By methods myself.
Only suggestion is to add them as overloads to existing methods.

@eatdrinksleepcode
Copy link

These seem like two distinct proposals that should be two distinct issues.

I agree with both proposals.

On Jun 24, 2015, at 8:02 AM, GSPP [email protected] wrote:

We have Distinct, Except, Intersect, Union, etc. Often it is required, though, to base the set operation on some key. OrderBy can do that for example.

I request that the following methods be added:

DistinctBy (pick the first element deterministically for each key)
ExceptBy (return all items of the first sequence that do not have a matching key in the second)
IntersectBy (return all items in both sequences that have at least one matching key)
UnionBy (return the first item in a.Concat(b) for each unique key)
MaxBy/MinBy
The set operations need to deal with duplicate keys in a sensible way. I have added to that list my idea for how to do that in sane and well-defined way.

Another commonly required helper method is one that splits a sequence into chunks of a fixed size:

static IEnumerable<IEnumerable> AsChunked(this IEnumerable sequence, int chunkSize)
This is useful for many things, for example:

Split a list into pages
Make AsParallel operate on coarser items by feeding it chunks
If you want to save a huge stream of items to files you can use this to split the sequence into one chunk per file
Request rows from the database in chunks of 1000 IDs
This requirement comes up on Stack Overflow a lot and there is no built-in way to do it.

The chunks should be drawn lazily. In fact the items of each chunk should be lazy as well. This allows for very big chunks such as 1 billion items per chunk.

Maybe we need two variants:

IEnumerable<IEnumerable> AsChunkedLazy
IEnumerable<T[]> AsChunkedEager
The second one would be easier to handle for consumers. The first one requires more disciplined use because each chunk must be enumerated at most once and the chunks must be enumerated in order. This is required for an efficient streaming operation.


Reply to this email directly or view it on GitHub.

@Clockwork-Muse
Copy link
Contributor

Distinct, Except, Intersect, and Union all have an overload that take an IEqualityComparer<T> (your description of Intersect is also worryingly off...). Max and Min also have predicate overloads like OrderBy. How is that part of your proposal different?
By definition, a "set" operation deals with duplicate keys by removing them. Vanilla LINQ (AFAICT) does this by choosing the first "unique" element (Distinct claims to be non-deterministic in documentation, but the current implementation behaves otherwise. The others either don't mention it or explicitly state the first one will be returned). PLINQ currently behaves non-deterministically even for ordered sources - I have an issue out for this, but don't know if it's considered a bug officially.

If you're chunking, you need to/should implement the Partitioner<TSource> abstract class, which would manage some of the overhead for you. PLINQ normally assigns enumerables one internally, although you can create and feed it one of your own. The built-in enumerable ones don't provide pre-set chunk sizes, most likely because the runtime would have to count them (PLINQ tends to load-balance things taken, so...).
You could probably use GroupBy to return you (small) chunked groupings - this would be pretty trivial in regular sequential LINQ (and more difficult in PLINQ, because you don't have control over thread interleaving). Or call ToLookup or something.

@svick
Copy link
Contributor

svick commented Jun 26, 2015

@Clockwork-Muse

I believe the set operations are for convenience, but it's a significant convenience. DistinctBy(x => x.Id) is much simpler than writing a class that implements IEqualityComparer<T> (though you can use a library to help you with that).

MinBy and MaxBy are different than Min and Max. The former return the value, the latter the whole item that contains the value. So e.g. people.Min(p => p.Id) returns the id, while people.MinBy(p => p.Id) returns the whole person.

Partitioner is about splitting a collection into chunks for parallel processing. But I believe just splitting a collection into chunks of fixed size for other purposes is common enough and Partitioner doesn't really help you with that. Actually, it doesn't even fit: Partitioner.GetPartitions() takes partitionCount, while for chunking you want to set the size of a chunk, not their number.

And you can use GroupBy() for this purpose, but it requires three method calls (Select to get the index, GroupBy, Select again to get rid of the index) and I think that's too complicated.

@Clockwork-Muse
Copy link
Contributor

@svick - Okay. Hopefully they've implemented a reasonable hash code for their IEquatable (assuming it's a custom type). Might be simplest to wrap it in a delegating comparer to feed the IDictionary in the current implementation.

@omariom - in such a case the set operations could be overloaded, but Max/Min can't be due to conflicts; what would the minimum of a type keyed by Id look like vs getting the minimum Id?

You're right about the use of "Chunk" vs "Partition", although we'd almost certainly want a custom partitioner for such a type too....
(For GroupBy I was thinking of something a little more hacky than that, but that's probably what you'd need for PLINQ)

@omariom
Copy link
Contributor

omariom commented Jun 26, 2015

in such a case the set operations could be overloaded, but Max / Min can't be due

I missed that. Suffix "By" is needed.

@GSPP
Copy link
Author

GSPP commented Jun 26, 2015

@Clockwork-Muse IntersectBy should be fixed now.

Distinct claims to be non-deterministic in documentation

Yes, it returns the first item. For compat reasons this can never be changed now. It should simply be documented. I hereby request that as well.

GroupBy instead of chunking is super tedious, slow, non-order-preserving (at least without additional work) and non-streaming.

@Clockwork-Muse
Copy link
Contributor

I've done (At least some) tests and stubbed out PLINQ's side. What have I missed?

From a behavior standpoint, the set operations can actually be implemented via a delegated equality comparer, that you could then use like return Distinct(source, new DelegatedEqualityComparer(predicate)). I used it to verify the tests for the new feature, but don't know all the performance ramifications of this.

MinBy/MaxBy ended up acting more like OrderBy et al - that is, null is a valid, and (depending on key type), the lowest value. This is in direct contrast to Max/Min currently, which only return null if the collection is empty.

I've stubbed out only one, completely lazy, version of Chunk, because I didn't know what the final direction on that was. Note that, when unordered, the behavior will likely be quite strange; at the moment the tests neither guarantee the number of chunks, nor the minimum number of elements in each chunk. The ordered version is expected to be able to guarantee both, but will be more expensive.... I can tighten that down, if necessary.

I'll continue to update the tests as I get related PRs in (I have some other, ongoing work). If I run out of other stuff to do, I might try to implement this stuff.

@GSPP
Copy link
Author

GSPP commented Jul 9, 2015

I would object to the name Chunk. If someone asked you "What does a method named Chunk do?" you couldn't answer. Maybe you'd guess "it returns a chunk...?". The name is not descriptive enough. Here are some proposals in preference order.

  1. SplitToChunks
  2. AsChunked

I do not know of another name that I would find acceptable.

@Clockwork-Muse
Copy link
Contributor

@GSPP - I don't care too much about the names, so it can/will certainly be changed. I'd like to wait until I have a more definite direction....?

@JonHanna
Copy link
Contributor

JonHanna commented Jul 9, 2015

If these are to be added (and I'm not sure whether they should be) can we make it a requirement that they are supported for both IQueryable<T> and IEnumerable<T>? I've seen a few implementations of some of these that only support the latter, though it shouldn't be hard to have:

public static TSource MaxBy<TSource, TKey>(this IQueryable<TSouce> source, Expression<Func<TSource, TKey>> keySelector)
{
  if (source == null) throw Error.ArgumentNull("source");
  if (keySelector == null) throw Error.ArgumentNull("keySelector");
  var sorted = source.OrderByDescending(keySelector);
  return default(TSource) == null ? sorted.FirstOrDefault() : sorted.First();
}

And so on. Therefore IQueryable<T> would retain parity with IEumerable<T> without having to fall-back to in-memory operations and without forcing providers to upgrade to match by supporting MaxBy and so on directly. (If someone could think of a way of allowing direct-support in providers, without forcing it, and without depending upon catching the exception that return source.Provider.Execute<TSource>(Expression.Call(null, ((MethodInfo)MethodBase.GetCurrentMethod()).MakeGenericMethod(new Type[]{typeof(TSource), typeof(TKey)}), new Expression[]{source.Expression, Expression.Quote(keySelector)})); would throw if a provider didn't cater for MaxBy, so much the better).

An alternative form of the above would have:

public static TSource MaxBy<TSource, TKey>(this IQueryable<TSouce> source, Expression<Func<TSource, TKey>> keySelector)
{
  if (source == null) throw Error.ArgumentNull("source");
  if (source.Provider is EnumerableQuery<TSource>)
    return source.AsEnumerable().MaxBy(keySelector.Compile());
  if (keySelector == null) throw Error.ArgumentNull("keySelector");
  var sorted = source.OrderByDescending(keySelector);
  return default(TSource) == null ? sorted.FirstOrDefault() : sorted.First();
}

This avoids taking the more expensive approach for linq-to-objects of sorting in memory before taking the first if an in-memory collection is accessed as a queryable, at the cost of adding an extra check.

@GSPP
Copy link
Author

GSPP commented Jul 9, 2015

I think all IQueryable expressions should go through the same process. No fallback methods. Represent queries full-fidelity and let the provider handle it. EF can be easily extended to implement the solution that you mentioned.

Query providers should throw exceptions for anything they do not understand.

@JonHanna
Copy link
Contributor

In that case, I think the additions would be more trouble than it's worth.

Taking MaxBy as one instance.

Approach people take in the current situation:

Either use .OrderByDescending(keySelector).First() (barely a line of code) or if they want the performance gain a tighter approach offers with IEnumerable<T> write it themselves or use a third-party implementation.

Approach people will take if MaxBy is engineered to work with current query providers:

Use .MaxBy(keySelector).

Approach people will take if MaxBy is provided in the same way as other queryable methods:

Use MaxBy. Get error from provider. Go to stackoverflow.com and ask question about the error. Have question closed as duplicate. Read answers from duplicate question. Hopefully use .OrderByDescending(keySelector).First() but possibly use .AsEnumerable().MaxBy(keySelector). Or just use .AsEnumerable().MaxBy(keySelector) because .AsEnumerable() is the magic fixer. (Or gods help them, .ToList().MaxBy(keySelector) because educational examples of Linq overuse .ToList() and that's what people are learning from.)

(Of course, we could provide it just on enumerables, but then people would likely use it and effectively be doing AsEnumreable().MaxBy(keySelector)).

At the same time, I doubt there would be many implementations that wouldn't result in the same behaviour as if .OrderByDescending(keySelector).First() had been used, and if there were providers who could offer a much improved implementation could they not just do so for that pattern in the expression?

These are meant to be convenience methods. A method that throws on some providers and not others isn't very convenient. As it is lack of support for Last() and some other methods with some providers is a nuisance.

@JohanLarsson
Copy link

What about an 'Extended LINQ' nuget package out-of-band?
I like this suggestion, think MinBy, MaxBy & DistinctBy are very useful and should not be rejected as code bloat.
Also, Rx & F# has them.

JonHanna referenced this issue in JonHanna/corefx Jul 12, 2015
Fixes #2146 if it is indeed considered desirable functionality.

Add MaxBy, MinBy, DistinctBy, ExceptBy, IntersectBy, UnionBy, and Chunk
methods to both Queryable (extension methods applied to IQueryable<T>)
and ‎Enumerable (extension methods applied to IEnumerable<T>).

The methods are fully consistent with each other, giving the same result
as observed from the outside, but by different means. The Queryable
methods are all delivered in terms of existing methods, and so will be
supported by existing providers that. In particular care is taken to
ensure that new methods that don't use IComparer<T> don't call into
Queryable methods that do, as they are relatively less-well supported.

The Queryable methods test for the query provider being EnumerableQuery and
in that case pass to the enumerable version, being better optimised for that
provider. Note that this introduces a project dependency of
System.Linq.Queryable.csproj and System.Linq.Queryable.Tests.csproj on
System.Linq.csproj
JonHanna referenced this issue in JonHanna/corefx Jul 12, 2015
Fixes #2146 if it is indeed considered desirable functionality.

Add MaxBy, MinBy, DistinctBy, ExceptBy, IntersectBy, UnionBy, and Chunk
methods to both Queryable (extension methods applied to IQueryable<T>)
and ‎Enumerable (extension methods applied to IEnumerable<T>).

The methods are fully consistent with each other, giving the same result
as observed from the outside, but by different means. The Queryable
methods are all delivered in terms of existing methods, and so will be
supported by existing providers that. In particular care is taken to
ensure that new methods that don't use IComparer<T> don't call into
Queryable methods that do, as they are relatively less-well supported.

The Queryable methods test for the query provider being EnumerableQuery and
in that case pass to the enumerable version, being better optimised for that
provider. Note that this introduces a project dependency of
System.Linq.Queryable.csproj and System.Linq.Queryable.Tests.csproj on
System.Linq.csproj
@JonHanna
Copy link
Contributor

Well, there's an implementation. With the set of operations that this gives, the lack of a .Max<T>(IComparer<T>) is beginning to seem like a bit of an absence, since there's now a way to do MaxBy a function or expression, but not in according to the standard .NET way of defining an ordering, so I think I shall add that and the corresponding Min too.

@JonHanna
Copy link
Contributor

Maybe we need two variants:

1 . IEnumerable<IEnumerable<T>> AsChunkedLazy
2. IEnumerable<T[]> AsChunkedEager

The second one would be easier to handle for consumers. The first one requires more disciplined use because each chunk must be enumerated at most once and the chunks must be enumerated in order. This is required for an efficient streaming operation.

For the version when called on IEnumerable<T> I used lazily-filled caching chunks, so if you enumerate through them as they come you get each new element as it's available from the source, but its stored, but if you move unto the next chunk the then previous chunk is filled. Memory use in cases where the caching isn't necessary is reduced by the very nature of the chunking. So it's lazy when use lazily and eager when used eagerly. There's also an optimisation for when it's called on an IList<T> where each chunk reads from the relevant segment of the source list as needed. The version called on IQueryable<T> will depend on the GroupBy implementation of the provider in question.

@GSPP
Copy link
Author

GSPP commented Jul 12, 2015

Good ideas.

Just one remark about:

Memory use in cases where the caching isn't necessary is reduced by the very nature of the chunking.

Extremely large chunks such as 1 billion elements should be fully supported in a streaming fashion. A possible use case would be to stream data into multiple big files each of which is supposed to stay below a certain size (such as 2GB).

Automatic materialization of chunks if used in a non-streaming way is a booby-trap but it might be worth it to get the two proposed APIs to unify into one. Nice idea. I'll probably add that to my own helper methods library and scrap the existing two methods :)

AsChunkedEager has an important use case: Improving efficiency of PLINQ.

var manySmallWorkItems = ...;
manySmallWorkItems
 .AsChunked(1000) //transition to chunks
 .AsParallel().Select(ProcessChunk) //low-overhead PLINQ
 .AsEnumerable().SelectMany(x => x); //flatten chunks

This cuts PLINQ overhead by 1000x and makes small work items more feasible.

I'm concerned that this would stop working if chunk materialization is lazy because concurrent materialization is not thread-safe. Maybe we need an enum AsChunkedOptions that controls laziness and thread-safety? This also could be used to disable support for IList should that ever be necessary.

@JonHanna
Copy link
Contributor

The question isn't so much one a lazy vs eager (my approach is lazy after all, but its caching means that something that makes eager assumptions from the outside won't realise, and the extra work done isn't a lot) as caching vs non-caching (with caching being inherent in an eager approach, and optional in a lazy one).

I think the most important thing is that things "just work" by default, without one having to worry about whether one can repeat enumerating a chunk, etc. and with things like someSource.Chunk(10).Select(ch => ch.Sum()) working with all queryable sources appropriately. (Some cases could potentially end up with N+1 calls against a database, but that's an existing issue with grouping constructs in Linq with existing solutions, so I don't think it's a reason to say no to anything here).

I would say that if a lazy-non caching approach was desired, then it should still check that it is starting at the correct point: Enumerable.Range(0, 1000).Chunk(10).Skip(1).Select(ch => ch.First()) should always return 10 because the skipped chunk should represent the values 0 to 9 whatever the implementation. It would not though be hard to have a version of FinishUp that just did the required number of MoveNext() calls to make this happen without touching Current never mind storing the result anywhere.

Such a pure streamed approach would have to work on IEnumerable<T> only, not IQueryable<T>. It perhaps belongs more in System.Collections.Generic than System.Linq, but since my branch is more (at the moment anyway) to have something more concrete for the discussion than anything else, I'll add in a ToChunkedStream().

@GSPP
Copy link
Author

GSPP commented Jul 12, 2015

I just noticed that ToChunkedStream is a primitive that can be used to generate any "inner collection":

static IEnumerable<TCollection<TItem>> AsChunked(IE<TItem> items, Func<IE<TItem>, TCollection>> collectionFactory)
where TCollection : IEnumerable<TItem>
{
 return items.ToChunkedStream().Select(collectionFactory);
}

That way the caller can make himself a completely lazy stream (chunk => chunk), an eager stream (chunk => chunk.ToList()) or even arrays if he wants to (chunk => chunk.ToArray()).

Just an idea.

@JonHanna
Copy link
Contributor

Not outside of linq-to-objects they can't. If we want this to be a Linq method then the IEnumerable case is just one possible provider. You can't convert chunk => chunk.ToList() into SQL.

@JonHanna
Copy link
Contributor

Anyway. See JonHanna/corefx@fed4174 for something on the fully-streamed approach.

@JonHanna
Copy link
Contributor

Back to the more general question. Now there's something to look at, does it belong in .NET core?

@JohanLarsson
Copy link

In my experience Min, Max & DistinctBy are more useful than GroupBy. I don't see any big downside. Maintaining useful code is a very small issue. Also I don't think expanding the API wit these methods will make LINQ that much more overwhelming for beginners.
Perhaps it could be an alternative to extract all of LINQ/Enumerable including this to a separate nuget package like Rx.
Upside with that is that releases could be more frequent, downside is that updates of the framework will not fix existing code.

@GSPP
Copy link
Author

GSPP commented Jul 13, 2015

I am very strongly in favor of having this built-in and available everywhere. On Stack Overflow we have a lot of questions that would make use of these features. We always have to answer with complicated workarounds or repeat the same code 100 times. There is demand for these things. Stack Overflow is a great place to find evidence for framework deficiencies. Maybe the team should answer 10 questions per day.

@JonHanna
Copy link
Contributor

Looking at some of the SO questions you mention I see that there are a few nugets already. (I even submitted a patch to one, called morelinq, but completely forgot in the meantime because I never used it but was hoping to add stuff of my own I did use so I could use it instead). I think this argues against a nuget; we're just competing with nugets and I might think them less than ideal in some regards (e.g. taking the enumerable-only approach I argued against above) they're there and they're working and if we want a nuget we should probably contribute to one of those instead.

The existence of the nuget offers arguments both for and against inclusion here:

Against:

  1. There's already the nuget, use it.
  2. Additions here would cause clashes.

For:

  1. While the nuget forces MaxBy into memory instead of on the database, it's very much something that can be improved on. It also seems to have a bug where Enumerable.Empty<string>().MaxBy(s => s) throws while Enumerable.Empty<string>().Max() does not, and surely anything.MaxBy(s => s) should have the same behaviour as anything.Max().
  2. People authoring and using these nugets often seem to have the attitude that these are "missing" features. The authors and users would presumably approve.
  3. The clashes are easily dealt with by developers. In many cases just by removing the use of the nuget. Indeed, doing so could leave them with working uses of these updates OOTB.

On balance I'm becoming less sceptical now. I was pretty much 50/50 on whether they should be added to Enumerable and Queryable directly, but am beginning to think they'd be a good addition.

@Clockwork-Muse
Copy link
Contributor

@hackcraft - Couldn't do Chunk performantly for PLINQ in such a case (there's ways to cheat the other methods) - you can't instantiate ParallelQuery.

@JonHanna
Copy link
Contributor

@Clockwork-Muse can you clarify? Is this referring to the use @GSPP says chunked would have for PLinq or something else?

@Clockwork-Muse
Copy link
Contributor

@hackcraft - the problem is that you couldn't have a PLINQ Chunk extension method unless corefx provides it - ParallelQuery has one constructor that takes an internal-only type, so third parties can't provide custom implementations. You can cheat by providing wrappers for existing ones, so you could do most stuff that way, but Chunk would have to go the GroupBy route if you tried (can't be lazy).
You could feed one to PLINQ, or a child enumerable, but that's already possible. If we don't provide one ourselves, PLINQ and vanilla LINQ end up with different signatures.

@JohanLarsson
Copy link

Perhaps stddev makes sense, it comes up every now and then.

@JonHanna
Copy link
Contributor

It's worth noting that while a fully optimal MaxBy(keySelector) would still be faster, .OrderByDescending(keySelector).First() now has the same O(n) performance of a separate MaxBy() rather than the O(n log n) it had previously. As such, adding a MaxBy() doesn't add as much as it would have, though it does still add something.

@JohanLarsson
Copy link

@JonHanna MaxBy() reads better also.

@GSPP
Copy link
Author

GSPP commented Jan 20, 2016

@JonHanna interesting point and valid. Do you have a rough estimate for how much slower that would be?

In any case MaxBy is far quicker to type and documents what you mean.

@GSPP
Copy link
Author

GSPP commented Jan 20, 2016

Another thought: This design of MaxBy does not support multiple criteria, right? Sorting does. Maybe we should have MayBy(Func, Func, ...) overloads and a MayBy(Func[]) overload. Or, adopt a similar design to ThenBy although that causes allocations. Not sure what the right approach is.

In my private utility library I have chosen the overload approach.

@JonHanna
Copy link
Contributor

The main thing that would make a well-optimised MaxBy faster than OrderBy(…).First() is the fact that MaxBy doesn't support multiple criteria so paths related to the possibility of ThenBy(…) won't be there. That would be a small O(n) addition to the time that MaxBy() could get rid of. Stop that being a feature of MaxBy and you lose that part of the optimisation that offering MaxBy allows. There'd also be a small constant overhead in the extra method constructing the IOrderedEnumerable-implementing object that then has First() called on it.

@GSPP
Copy link
Author

GSPP commented Jan 20, 2016

I imagine that producing the first element of a sort is still quite expensive despite being O(1). In fact the whole sequence must be materialized which is O(N) space whereas MaxBy always is O(1) space. Looking at EnumerableSorter seems to confirm this. QuickSelect looks like it should be quite fast but for a cheap CompareKeys it still does more than a naive MaxBy would do.

MayBy implementations should be specialized to the number of criteria for small numbers of criteria (1-3?). That potential disadvantage can be solved.

@JonHanna
Copy link
Contributor

QuickSelect is only used for ElementAt/OrDefault. First/OrDefault and Last/OrDefault use a Θ(n) worse-case path that is O(1) in space. They're essentially MaxBy and MinBy. The only extra overhead is creating the comparer used.

@JonHanna
Copy link
Contributor

(That extra overhead is proportional to the number of potential tiebreak cases introduced by ThenBy steps, which multi key MaxBy would add to MaxBy).

@VSadov
Copy link
Member

VSadov commented Nov 19, 2016

It is the goal of Linq to support all possible scenarios. Any additional API introduces costs of ongoing support, and additional size.

As I see there are two parts in this proposal

  1. XxxxBy methods. - It seems like most of that functionality is already expressible via custom comparers.
    It is unclear if the benefits of an additional way to do the same thing would be attractive enough to expand the library.
  2. An API to split a sequence in to subsequences of some predefined size.
    Again - Is that a common scenario to put in System.Linq?
    Why the same size? Should it fail if minimum number of items is not available or just return shorter subsequence? The "lazy" part is not very clear either. Can it work at all if the source is not "seekable" like an array or a list?

"Chunking" seems to be a special case scenario, that would not come up very often and when it would come up, a simple solution may not be sufficient.

@VSadov VSadov closed this as completed Nov 19, 2016
@svick
Copy link
Contributor

svick commented Nov 19, 2016

@VSadov

I think that at least MinBy/MaxBy and chunking are fairly common.


You didn't address MinBy and MaxBy at all. What is the reason for rejecting those? Is it because OrderBy(…).First()/.Last() is now efficient? If that is the reason, my concern is that OrderBy(…).First() looks slow, while MinBy() looks like it's optimized exactly for this case. That increases chances that people will use this instead of writing their own code to do it and also makes the performance of the code easier to understand.


I don't think chunking is a special case. I think APIs (especially web APIs) that accept a limited number of inputs are fairly common1 and chunking is the natural solution to that. I'm not sure what is the right solution regarding laziness, but even the simple solution returning lazy enumerable of eager enumerables should be enough for the common uses.

1 Specifically, I encountered this when making a SQL Server query that exceeded some limit on the query size. And also when using the MediaWiki API to request information about multiple articles based on their ID, where the limit is 50 articles per query.

@jnm2
Copy link
Contributor

jnm2 commented Nov 19, 2016

@VSadov I completely agree with @svick, came to argue the same. IComparer does not cut it.

@GSPP
Copy link
Author

GSPP commented Nov 27, 2016

Chunking is so common. All business apps that I have worked on needed this. Processing data in batches is often required. Many Stack Overflow questions have a simple answer provided a chunking primitive is available.

I often see ad hoc chunking code and it's nasty. Most of the time, batching should look like:

foreach (var batch in items.AsChunked(1000)) ...

Or:

items
.AsChunked(1000)
.Select(batch => ...)

The way to query many rows by some key from an RDBMS is:

var keys = ...; //huge

keys
.AsChunked(1000)
.SelectMany(keyBatch => Load(keyBatch))

Trivial.

@Thaina
Copy link

Thaina commented Feb 6, 2019

I think this really should support in core library, it really common

Actually the linq Min and Max itself is the one that was uncommon. And we can replicate its behaviour of Min and Max by MinBy and MaxBy

var minValue = items.Min((item) => item.value);
// same
var minValue = items.MinBy((item) => item.value).value;

It a little more verbose with this specific task but MinBy is more versatile to handle many more situation

AsChunked it also too common to draw it from nuget. I just want to use this only function and it need to come with whole MoreLinq library. I think it would be of used even more than SequenceEqual

Please reopen this

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.0.0 milestone Jan 31, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 5, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-needs-work API needs work before it is approved, it is NOT ready for implementation area-System.Linq
Projects
None yet
Development

No branches or pull requests