Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve result order when doing DISTINCT #8606

Closed
roji opened this issue May 26, 2017 · 35 comments
Closed

Preserve result order when doing DISTINCT #8606

roji opened this issue May 26, 2017 · 35 comments
Labels
closed-out-of-scope This is not something that will be fixed/implemented and the issue is closed. good first issue This issue should be relatively straightforward to fix. help wanted This issue involves technologies where we are not experts. Expert help would be appreciated. type-enhancement

Comments

@roji
Copy link
Member

roji commented May 26, 2017

A while ago I filed #7512. While the test that triggered that issue was modified to pass (#7526), it seems like the root problem is still there (despite additional work in #7525).

Running OrderBy_coalesce_skip_take_distinct_take on Npgsql in 2.0.0-preview1 fails. The generated SQL is as follows:

SELECT DISTINCT "t".*
        FROM (
            SELECT "p"."ProductID", "p"."Discontinued", "p"."ProductName", "p"."UnitPrice", "p"."UnitsInStock"
            FROM "Products" AS "p"
            ORDER BY COALESCE("p"."UnitPrice", 0.0)
            LIMIT $1 OFFSET $2
        ) AS "t"
        LIMIT $3

As in #7512, the outer query contains no ORDER BY, and unlike SqlServer PostgreSQL does not seem to preserve the ordering imposed in the inner query - so output from the query is non-deterministic in order.

@smitpatel
Copy link
Member

Oops! Looks like the test which was modified returned back in code. Based on the logic, EF applied after Distinct, there are 2 options here

  1. Remove the test as the results are non-deterministic.
  2. Keep the test as it works when inner ordering is preserved (Linq/SqlServer/SQLite) does that. Providers which does not preserve inner ordering have to skip the test.

@anpete - Thoughts?

@roji
Copy link
Member Author

roji commented May 26, 2017

I have no problem skipping the test in Npgsql, but don't you guys have an issue with this varying behavior across providers, given the same LINQ? I mean, it's entirely possible to enforce deterministic results by adding an ORDER BY clause to the external query - this seems like the correct way to solve, even if it's more effort on your end.

(I'm not sure if this additional ORDER BY clause, which is useless on SqlServer, would impact performance in any way, if so you could allow providers to specify whether it's needed or not)

@smitpatel
Copy link
Member

https://github.com/aspnet/EntityFramework/blob/dev/src/EFCore.Specification.Tests/QueryTestBase.cs#L7586
There is already test which has order by after Distinct & before take. Therefore I suggested first option to remove the test.

The test in way verifying that we generate exact same SQL as linq was written (with potential of wrong results). Even Linq would give different results based on how Distinct is implemented. Luckily for default behavior of Linq it preserved ordering. I believe, Linq does not guarantee the order so no EF provider should try to work-around it. If user wrote such query then results will be non-deterministic. Optionally we can throw exception blocking such queries.

@roji
Copy link
Member Author

roji commented May 26, 2017

Ah, I wasn't aware of the lack of a guarantee in LINQ - is there any sort of specification? And what do you mean by the "default behavior of LINQ", is there not simply the LINQ to Objects as the sole implementation?

In general, it was my impression that the goal is to make EF Core behave as closely as possible to LINQ to Objects, which is why a lot of your tests simply compare the answers coming out of both systems. Saying that you leave this behavioral aspect weakens that. Of course, it may not always be possible for providers to mimic LINQ to Objects behavior, but in this case it does seem possible (albeit with effort on the core side).

@maumar
Copy link
Contributor

maumar commented May 26, 2017

Test infra for complex navigations tests adds client side ordering to make results deterministic (at least top level projection). We could add this to other test suites

@smitpatel
Copy link
Member

@maumar - Complex nav did that for deep comparison. But that is testing thing. Given Customer.Orders all orders will be there and the ordering is not guaranteed.

We could have done the same just for the case of Distinct but when you apply skip/take afterwards, the results returned are totally different. Even if you apply ordering while testing, it won't help.

@maumar
Copy link
Contributor

maumar commented May 26, 2017

@smitpatel deep comparison is one thing (for includes) but we also do generic orderby for all queries, unless they explicitly assert the order. Basically, we execute the test query, and the baseline query. And then apply ordering in memory to both. There is also asserter func to specify which column to order by, in some complex objects are returned

@smitpatel
Copy link
Member

No, we don't.
TestHelpers.AssertResults method does contains check in the absence of asserter & assertOrder.
When user writes a query without orderby the results can arrive in any order which depends on the provider (SqlServer returns rows sorted by PK by default but linq/ix-async can return in different order based on how they are implemented.) If user wants the results to be order, there has to be order by. Therefore we do contains check because it is possible for database to return result in different order than linq. It magically works for SqlServer but not for all providers.

@maumar
Copy link
Contributor

maumar commented May 26, 2017

ComplexNavigations uses different logic - we always provide default post-execution ordering - see AssertQuery methods on ComplexNavigationsQueryTestBase

@smitpatel
Copy link
Member

@maumar - Checked that class. We only order for AssertIncludeQuery (We can get away from it if we find the matching element in other set and then compare navigations). Normal AssertQuery still delegates to TestHelpers.AssertQuery.

That being said, if the query does not have ordering specified then we should not verify ordering. It is optional to put ordering for tests but contains is better concept due to the non-deterministic behavior.

@maumar
Copy link
Contributor

maumar commented May 26, 2017

I don't see how contains is better. Sorting in memory and then ordering is definitely faster than contains. Also, contains can give you false positives, e.g. expected: { 1, 2, 3, 4, 4 }, actual: {1, 2, 2, 3, 4 }.

The way complex navs tests are designed - if you want to verify order, you set the verifyOrdered flag to true, and then client ordering is skipped (so the original order of the results is verified). Otherwise, we apply client side ordering, after the query has been executed.

Default ordering is (e => e) which works out of the box if we return collection of scalars. If we return entities or complex types, you need to provide the ordering func in the AssertQuery explicitly (e.g. e => e.Id), otherwise test will throw (trying to order by collection of non-order-comparable objects).

ComplexNavs use AssertResults overload starts on line 390, which doesn't use Contains at all.

@smitpatel
Copy link
Member

Duplicate results is fair point. Also ordering provides some perf benefit for testing. Can you file a new issue to introduce ordering while verifying results? There would not be any impact on this issue regardless.

@maumar
Copy link
Contributor

maumar commented May 26, 2017

@smitpatel filed #8617

@divega
Copy link
Contributor

divega commented May 27, 2017

Talked to @smitpatel about this at length. I have tried to capture here the conclusions (from my point of view 😄):

  1. Given the current design, it should be fine to either remove this test or convert it into a more ad-hoc test that only asserts that the result contains 3 elements, but the value of the latter would be minimal.

  2. We probably won't do this immediately, so the recommendation for @roji and provider writers is to just skip it for now

Re the functional problem with non-deterministic results, which is caused by paging operations in absence of order:

  1. We had the idea that those could cause a warning. We remember discussing similar things already.

  2. I personally believe the argument made by @roji at Preserve result order when doing DISTINCT #8606 (comment) holds:

  • It is hard to find anything like a specification saying that Distinct() does not preserve the order (I only found a comment from @mattwar in a forum), but the implementation of Enumerable.Distinct() is unlikely to change.
  • It is nice to get closer to the LINQ to Objects behavior in other providers. We may not always do it because it isn't possible or because we believe that the case is uncommon and the priority isn't high.

I would like to decide in triage if this is something we care enough to have a warning or to have an issue in the backlog about trying to "fix" it.

@divega divega removed this from the 2.0.0 milestone May 27, 2017
@roji roji changed the title Still no ORDER BY in outer query Preserve result order when doing DISTINCT May 27, 2017
@roji
Copy link
Member Author

roji commented May 27, 2017

Great, sounds good. I'll disable affected tests for now, linking back to this issue. Making distinct order-preserving across providers is definitely not the most important thing to do in EF Core right now, but I'm glad you guys see it as a goal.

@mattwar
Copy link

mattwar commented May 30, 2017

The LINQ spec does not claim that the items are ordered the same after Distinct, but the Enumerable.Distinct does this anyway. The reason its not spec'd to be this way is the impractical cost it would push onto actual database queries.

@divega
Copy link
Contributor

divega commented May 30, 2017

@mattwar thanks for chimmimg in. Indeed, potential cost on database queries is another reason to avoid it.

@roji
Copy link
Member Author

roji commented May 30, 2017

I'm still not sure of the value of discussing the LINQ spec here - it seems to me that developer expectation is for behavior to correspond to Enumerable.Distinct. But it's true that adding ORDER BY to the DISTINCT outer query would probably have at least some performance impact, even if done only for PostgreSQL/Npgsql. I'm OK if you guys want to close this.

@divega
Copy link
Contributor

divega commented May 30, 2017

@roji I think we are in agreement about the developer expectation. What is somewhat relevant about the "spec" is that this was intentionally left up to providers by the original LINQ designers.

Personally I am leaning towards having a warning. We will discuss it next time we triage.

Just to add some more information, for this particular case, the Distinct() operation is superfluous and we could optimize it out: the projection includes the keys of the only query source (e.g. there are no SelectMany() or joins). Incidentally that would make the translation of this query simple and the order deterministic. I have created #8643 to track that.

@ajcvickers ajcvickers added this to the Backlog milestone May 31, 2017
@ajcvickers ajcvickers added help wanted This issue involves technologies where we are not experts. Expert help would be appreciated. type-enhancement labels May 31, 2017
@tuespetre
Copy link
Contributor

FWIW, the documentation for Enumerable explicitly states that the resulting sequence is unordered:

The Distinct<TSource>(IEnumerable<TSource>) method returns an unordered sequence that contains no duplicate values.

https://msdn.microsoft.com/en-us/library/bb348436(v=vs.110).aspx#Anchor_2

@roji
Copy link
Member Author

roji commented Jul 14, 2017

@tuespetre ok, thanks for that... I guess that can mean this issue can be closed, even though I still feel there's a "surprise effect" there. However, if the docs are explicit on this point it's probably not worth the effort needed to ensure order preservation. I guess the EF Core team should make the decision on this.

@ajcvickers
Copy link
Member

@divega Should we re-tirage?

@smitpatel
Copy link
Member

What was the decision made while putting this into backlog?

@ajcvickers
Copy link
Member

@smitpatel I think the warning the @divega refers to.

@divega
Copy link
Contributor

divega commented Jul 14, 2017

I think that was the conclusion from what I remember: warning for paging operations in absence of order.

@smitpatel
Copy link
Member

I think warning should be more than enough.

@roji - In general we try to be as close to L2O as possible. Especially when linq has stronger semantics or cost to achieve that is not high. e.g. string.StartsWith/EndsWith/Contains are not translated to just SQL LIKE they are translated differently so that it can be closer to linq. It was easy to do in the sense that it needed just bit different SQL to be generated. But on the other hand, case sensitivity is still different. Linq is case sensitive but SqlServer is not by default. Since trying to make it case sensitive is difficult task (setting collations on columns etc), we decided not to do it. It would causes same surprise effect in case of client eval. In our tests for SqlServer, we generate case insensitive l2O query specifically. In the end, if the user is relying on case sensitivity then they would also be aware of SqlServer being case insensitive.

This is bit hard to discover but similar example. There will be somewhat surprising effects, but if user requires particular page results to be generated there would be order by added. Issuing warning will avoid the discoverability issue and let user know that query is provider dependent.

@tuespetre
Copy link
Contributor

Paging? 🤔 I thought this was about Distinct 😆

@roji
Copy link
Member Author

roji commented Jul 14, 2017

@smitpatel I understand what you're saying and also agree in general. The problem in my mind is not surprise effect because of client eval, but rather what it would take to create a cross-database application with EF Core. The more areas which are left "undefined" such as case sensitivity - where each provider does their own thing - the more problematic this is going to be.

I actually don't think it applies very much to this issue... Yeah, a user coming from SqlServer to Npgsql will be surprised to see Distinct() not preserving order, but it's easily fixable by the user themselves adding an additional OrderBy() after the Distinct() (plus even the official docs says not to rely on order coming out). The equivalent for case sensitivity is unfortunately much less clear.

@roji
Copy link
Member Author

roji commented Jul 14, 2017

@tuespetre I think the point was that the non-preserving nature of Distinct() creates problems for paging (i.e. users paging through a big list in a webapp), since paging in general goes together with ordering.

The case sensitivity discussion is a continuation from #9126

@ajcvickers
Copy link
Member

@roji From the beginning of EF Core one of our principles has been that we would not attempt to create a full abstraction over the underlying database such that an application would work the same if moved from one database provider to another. There are two main reasons for this:

  • It's pretty much impossible to do anyway unless you either go for a lowest common denominator approach (which greatly limits usefulness) or jump through hoops to try to simulate things that are not supported in unnatural ways. Even then there always tend to be subtle differences, which means in practice if you want to work with multiple different databases you really need to test, test, test anyway.
  • We want to make sure that the native capabilities of underlying databases are exposed in an easy-to-use and sensible manner. So, for example, if I know that SQL Server has some feature that I want to use, then I should be able to use it easily without spending a lot of time figuring out how it has been abstracted.

So instead of trying to always behave the same for every provider, we instead try to create common patterns and concepts, both in API and behavior, that can be implemented by different providers in natural ways, but which may behave slightly differently in different providers because that makes best use of the provider's native functionality. A good example of this is key generation--integer keys are value generated by convention on any provider, but different providers use different mechanisms to implement this. These different mechanisms mean that the behavior can be different between different providers--e.g. temp keys or not--but at a high level EF, the provider, and the database will always work together to generate some key value that will be valid and fixed after SaveChanges.

As @smitpatel and @divega have been saying, we do try to make things consistent without doing things that are unnatural for the underlying database, but where there are native differences it is also in keeping with the spirit of EF Core to let those differences surface.

The bottom line is always that if you're going to use different databases, then they may behave differently, and testing is the only real way to be sure that the application works as expected.

@roji
Copy link
Member Author

roji commented Jul 14, 2017

@ajcvickers I understand and agree. However, compare key generation with case sensitivity: it's enough to have an int column named Id, and it will automatically be setup as autoincrement - be it identity in SQL Server, serial/sequence in PostgreSQL, etc. In other words, it's an abstraction that happens to work really well, since different low-level database-specific mechanisms produce the same end result and the user can ignore the plumbing. Case sensitivity is something that the user simply cannot ignore, as it affects their code directly, e.g. every time they do an equality check.

I really think it's a case-by-case thing. If making SQL Server case-sensitive is too complicated, then so be it - at the end of the day we're constrained by what our databases allow us to do. On the other hand, note how we managed to do things well with LIKE escaping - we provided an API which will (hopefully) work the same across databases, by being very clear about the expected behavior (no escape character by default), leading the Npgsql provider to do what is necessary to work that way. It wasn't a lot of hoops, so it made sense, and everyone benefit from uniform behavior and less surprises when moving across databases.

So I'd hope that providing uniform behavior across databases be a "soft" goal of EF Core, always as a function of effort needed to make it so.

@ajcvickers
Copy link
Member

@roji Agreed. So if we can be consistent without forcing a provider to do unnatural and/or expensive things, then we should. But it's also fine to say that the behavior may be different.

@ajcvickers
Copy link
Member

@roji We discussed this again and decided that the provider should override this test with a custom implementation, or possibly just keep it disabled if it is not providing (no pun intended) enough value.

@ajcvickers
Copy link
Member

@roji We would also likely consider a PR to do something cleanly in this repo.

@ajcvickers ajcvickers added the closed-out-of-scope This is not something that will be fixed/implemented and the issue is closed. label Apr 24, 2018
@ajcvickers ajcvickers removed this from the Backlog milestone Apr 24, 2018
@roji
Copy link
Member Author

roji commented Apr 27, 2018

No problem, thanks for revisiting this.

@bricelam bricelam added good first issue This issue should be relatively straightforward to fix. and removed good first issue This issue should be relatively straightforward to fix. labels May 31, 2019
@ajcvickers ajcvickers reopened this Oct 16, 2022
@ajcvickers ajcvickers closed this as not planned Won't fix, can't repro, duplicate, stale Oct 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-out-of-scope This is not something that will be fixed/implemented and the issue is closed. good first issue This issue should be relatively straightforward to fix. help wanted This issue involves technologies where we are not experts. Expert help would be appreciated. type-enhancement
Projects
None yet
Development

No branches or pull requests

8 participants