-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preserve result order when doing DISTINCT #8606
Comments
Oops! Looks like the test which was modified returned back in code. Based on the logic, EF applied after Distinct, there are 2 options here
@anpete - Thoughts? |
I have no problem skipping the test in Npgsql, but don't you guys have an issue with this varying behavior across providers, given the same LINQ? I mean, it's entirely possible to enforce deterministic results by adding an ORDER BY clause to the external query - this seems like the correct way to solve, even if it's more effort on your end. (I'm not sure if this additional ORDER BY clause, which is useless on SqlServer, would impact performance in any way, if so you could allow providers to specify whether it's needed or not) |
https://github.com/aspnet/EntityFramework/blob/dev/src/EFCore.Specification.Tests/QueryTestBase.cs#L7586 The test in way verifying that we generate exact same SQL as linq was written (with potential of wrong results). Even Linq would give different results based on how Distinct is implemented. Luckily for default behavior of Linq it preserved ordering. I believe, Linq does not guarantee the order so no EF provider should try to work-around it. If user wrote such query then results will be non-deterministic. Optionally we can throw exception blocking such queries. |
Ah, I wasn't aware of the lack of a guarantee in LINQ - is there any sort of specification? And what do you mean by the "default behavior of LINQ", is there not simply the LINQ to Objects as the sole implementation? In general, it was my impression that the goal is to make EF Core behave as closely as possible to LINQ to Objects, which is why a lot of your tests simply compare the answers coming out of both systems. Saying that you leave this behavioral aspect weakens that. Of course, it may not always be possible for providers to mimic LINQ to Objects behavior, but in this case it does seem possible (albeit with effort on the core side). |
Test infra for complex navigations tests adds client side ordering to make results deterministic (at least top level projection). We could add this to other test suites |
@maumar - Complex nav did that for deep comparison. But that is testing thing. Given We could have done the same just for the case of |
@smitpatel deep comparison is one thing (for includes) but we also do generic orderby for all queries, unless they explicitly assert the order. Basically, we execute the test query, and the baseline query. And then apply ordering in memory to both. There is also asserter func to specify which column to order by, in some complex objects are returned |
No, we don't. |
ComplexNavigations uses different logic - we always provide default post-execution ordering - see AssertQuery methods on ComplexNavigationsQueryTestBase |
@maumar - Checked that class. We only order for AssertIncludeQuery (We can get away from it if we find the matching element in other set and then compare navigations). Normal AssertQuery still delegates to TestHelpers.AssertQuery. That being said, if the query does not have ordering specified then we should not verify ordering. It is optional to put ordering for tests but contains is better concept due to the non-deterministic behavior. |
I don't see how contains is better. Sorting in memory and then ordering is definitely faster than contains. Also, contains can give you false positives, e.g. expected: { 1, 2, 3, 4, 4 }, actual: {1, 2, 2, 3, 4 }. The way complex navs tests are designed - if you want to verify order, you set the verifyOrdered flag to true, and then client ordering is skipped (so the original order of the results is verified). Otherwise, we apply client side ordering, after the query has been executed. Default ordering is (e => e) which works out of the box if we return collection of scalars. If we return entities or complex types, you need to provide the ordering func in the AssertQuery explicitly (e.g. e => e.Id), otherwise test will throw (trying to order by collection of non-order-comparable objects). ComplexNavs use AssertResults overload starts on line 390, which doesn't use Contains at all. |
Duplicate results is fair point. Also ordering provides some perf benefit for testing. Can you file a new issue to introduce ordering while verifying results? There would not be any impact on this issue regardless. |
@smitpatel filed #8617 |
Talked to @smitpatel about this at length. I have tried to capture here the conclusions (from my point of view 😄):
Re the functional problem with non-deterministic results, which is caused by paging operations in absence of order:
I would like to decide in triage if this is something we care enough to have a warning or to have an issue in the backlog about trying to "fix" it. |
Great, sounds good. I'll disable affected tests for now, linking back to this issue. Making distinct order-preserving across providers is definitely not the most important thing to do in EF Core right now, but I'm glad you guys see it as a goal. |
The LINQ spec does not claim that the items are ordered the same after Distinct, but the Enumerable.Distinct does this anyway. The reason its not spec'd to be this way is the impractical cost it would push onto actual database queries. |
@mattwar thanks for chimmimg in. Indeed, potential cost on database queries is another reason to avoid it. |
I'm still not sure of the value of discussing the LINQ spec here - it seems to me that developer expectation is for behavior to correspond to Enumerable.Distinct. But it's true that adding ORDER BY to the DISTINCT outer query would probably have at least some performance impact, even if done only for PostgreSQL/Npgsql. I'm OK if you guys want to close this. |
@roji I think we are in agreement about the developer expectation. What is somewhat relevant about the "spec" is that this was intentionally left up to providers by the original LINQ designers. Personally I am leaning towards having a warning. We will discuss it next time we triage. Just to add some more information, for this particular case, the |
FWIW, the documentation for
https://msdn.microsoft.com/en-us/library/bb348436(v=vs.110).aspx#Anchor_2 |
@tuespetre ok, thanks for that... I guess that can mean this issue can be closed, even though I still feel there's a "surprise effect" there. However, if the docs are explicit on this point it's probably not worth the effort needed to ensure order preservation. I guess the EF Core team should make the decision on this. |
@divega Should we re-tirage? |
What was the decision made while putting this into backlog? |
@smitpatel I think the warning the @divega refers to. |
I think that was the conclusion from what I remember: warning for paging operations in absence of order. |
I think warning should be more than enough. @roji - In general we try to be as close to L2O as possible. Especially when linq has stronger semantics or cost to achieve that is not high. e.g. string.StartsWith/EndsWith/Contains are not translated to just SQL This is bit hard to discover but similar example. There will be somewhat surprising effects, but if user requires particular page results to be generated there would be order by added. Issuing warning will avoid the discoverability issue and let user know that query is provider dependent. |
Paging? 🤔 I thought this was about |
@smitpatel I understand what you're saying and also agree in general. The problem in my mind is not surprise effect because of client eval, but rather what it would take to create a cross-database application with EF Core. The more areas which are left "undefined" such as case sensitivity - where each provider does their own thing - the more problematic this is going to be. I actually don't think it applies very much to this issue... Yeah, a user coming from SqlServer to Npgsql will be surprised to see Distinct() not preserving order, but it's easily fixable by the user themselves adding an additional OrderBy() after the Distinct() (plus even the official docs says not to rely on order coming out). The equivalent for case sensitivity is unfortunately much less clear. |
@tuespetre I think the point was that the non-preserving nature of Distinct() creates problems for paging (i.e. users paging through a big list in a webapp), since paging in general goes together with ordering. The case sensitivity discussion is a continuation from #9126 |
@roji From the beginning of EF Core one of our principles has been that we would not attempt to create a full abstraction over the underlying database such that an application would work the same if moved from one database provider to another. There are two main reasons for this:
So instead of trying to always behave the same for every provider, we instead try to create common patterns and concepts, both in API and behavior, that can be implemented by different providers in natural ways, but which may behave slightly differently in different providers because that makes best use of the provider's native functionality. A good example of this is key generation--integer keys are value generated by convention on any provider, but different providers use different mechanisms to implement this. These different mechanisms mean that the behavior can be different between different providers--e.g. temp keys or not--but at a high level EF, the provider, and the database will always work together to generate some key value that will be valid and fixed after SaveChanges. As @smitpatel and @divega have been saying, we do try to make things consistent without doing things that are unnatural for the underlying database, but where there are native differences it is also in keeping with the spirit of EF Core to let those differences surface. The bottom line is always that if you're going to use different databases, then they may behave differently, and testing is the only real way to be sure that the application works as expected. |
@ajcvickers I understand and agree. However, compare key generation with case sensitivity: it's enough to have an int column named Id, and it will automatically be setup as autoincrement - be it identity in SQL Server, serial/sequence in PostgreSQL, etc. In other words, it's an abstraction that happens to work really well, since different low-level database-specific mechanisms produce the same end result and the user can ignore the plumbing. Case sensitivity is something that the user simply cannot ignore, as it affects their code directly, e.g. every time they do an equality check. I really think it's a case-by-case thing. If making SQL Server case-sensitive is too complicated, then so be it - at the end of the day we're constrained by what our databases allow us to do. On the other hand, note how we managed to do things well with LIKE escaping - we provided an API which will (hopefully) work the same across databases, by being very clear about the expected behavior (no escape character by default), leading the Npgsql provider to do what is necessary to work that way. It wasn't a lot of hoops, so it made sense, and everyone benefit from uniform behavior and less surprises when moving across databases. So I'd hope that providing uniform behavior across databases be a "soft" goal of EF Core, always as a function of effort needed to make it so. |
@roji Agreed. So if we can be consistent without forcing a provider to do unnatural and/or expensive things, then we should. But it's also fine to say that the behavior may be different. |
@roji We discussed this again and decided that the provider should override this test with a custom implementation, or possibly just keep it disabled if it is not providing (no pun intended) enough value. |
@roji We would also likely consider a PR to do something cleanly in this repo. |
No problem, thanks for revisiting this. |
A while ago I filed #7512. While the test that triggered that issue was modified to pass (#7526), it seems like the root problem is still there (despite additional work in #7525).
Running OrderBy_coalesce_skip_take_distinct_take on Npgsql in 2.0.0-preview1 fails. The generated SQL is as follows:
As in #7512, the outer query contains no ORDER BY, and unlike SqlServer PostgreSQL does not seem to preserve the ordering imposed in the inner query - so output from the query is non-deterministic in order.
The text was updated successfully, but these errors were encountered: