-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SQL translation should be more generic and operate on end results #12552
Comments
I agree. I think a ‘second pass’ at untranslatable expressions that can extract client-side evaluatable expressions into parameters is the way to go. I have mentioned this elsewhere before but I am not sure what the team thinks of it. |
This should not be a second pass, but rather a first pass, before calling EF. This can be done outside EF, or on top of EF. Please see the excellent expression transformation library, erecruit.Expr. This library is agnostic to EF, although there are some clever optimizations that handle problems with how EF caches queries with IEnumerable.Contains expressions. The library takes a divide-and-conquer approach, by implementing a few basic abstractions:
If you're not familiar with this library, it is fundamentally superior to Joseph Albahari's PredicateBuilder. Note that erecruit.Expr does not leak to the caller, while PredicateBuilder does: With PredicateBuilder, you must call AsExpandable() first, meaning the API designer must know ahead of time you want to use PredicateBuilder. Instead, erecruit.Expr lets you just call Expand() at the very end of your query chain. Also, the only penalty for calling Expand multiple times is performance. The only gotcha with using erecruit.Expr is it has a max recursion depth equal to the available call stack of the underlying .NET host platform. There are various reasons why doing this request internally is a mistake:
Please close. |
Duplicate of #12338 |
I would recommend reading this comment and later discussion. |
@jzabroski I do not mean a ‘second pass’ as in a pass over the entire expression tree. When EF hits a a selector/predicate/etc. it tries to rewrite certain expressions into SQL-translatable expressions. If it encounters, for instance, a member access or method call expression it cannot rewrite, it tends to return null and the translation of that subtree is aborted. What I am saying is that at that point, there is a ‘second chance’ at extracting that portion of the subtree into a parameterized expression rather than aborting translation. This is not specific to the DateTime/SQLite issue; it is a general solution to a general problem and the ‘base’ SQL query provider (EF Core’s relational provider in this case) is the most suitable place for such a mechanism. |
@tuespetre There's a couple reasons you're incorrect or coming up with the wrong solution.
A Roslyn code analyzer for EF could also potentially do sarg-ability analysis. I believe there are many ways people have been tripped up with the sarg-ability of LINQ queries in my career:
Then there is just "API Coverage" concerns:
|
@jzabroski it does not prevent caching. The subtree is pulled out into a lambda expression that receives the correct closure instance as a parameter at runtime, when it is executed to produce the value for that execution of the query. I have implemented it, tested it, and used it in production. It’s good 👍 |
At the risk of commenting on a decision that's been sealed beyond reconsideration, I feel like there is room for a healthy discussion here out of which some good can come. Sargability is extremely important to me and I've filed defect reports with several open source database engines and products to offer some (too-clever) workarounds to get certain datetime operations to use the index, so I am not blithely making suggestions that would break sargability here. As someone that has just begun using EFCore and did not use EF previously, my "fresh take" on it was that the default would be for all efforts to be focused on getting my C# LINQ expression translated to code that could most optimally run on the db server. The existence of the Anyone that's handwritten queries knows about optimizing the indexes and queries together to come up with fast, efficient queries and is familiar with the benefits of limiting the amount of rows carried back from the db server, etc. But what EFCore has done here is take those common concerns and give them the backseat to an even bigger problem we have to worry about now: whether or not the code will even be executed on the db in the first place. The benefits of C#-to-SQL translation are enormous and it's an extremely laudable effort, but I feel like it should be done in a way where precedence is given to in-db execution instead of to maximized lambda support at the cost of runtime warnings masking massive performance issues for cases that could be trivially supported as a matter of design decision. All that aside and with the caveat that I don't know how the code that does this currently looks like, I don't understand why SQL translation can't consider the return type of a function call not dependent on the value of a record column in its decision to execute-then-translate or translate-then-execute flow. For the case where the lambda invokes a C# function on the column/output of the database, it's a clearcut case of "the user wants to invoke local code to evaluate db values, without writing a for loop" which then proceeds to answer the question "is there an optimization we can perform to eliminate the need to bring this data back to the application server for local evaluation" in which case a db construct can be used instead of the local C# function call and the translation would be over. In case the answer is no, it's a shortcut to "unfortunately this needs to be evaluated locally" and again the translation would be over. But in the case that the db column/value is not an input to a subexpression in the lambda, the same steps can be applied only there's a recourse instead of giving up. You start with the same question, "is there a db-level construct that can optimize the entire expression" only when the result is no, the question becomes "can we reduce the c# expression into a different one that we could perhaps optimize?" which can be iteratively evaluated until that answer becomes "no." As concrete examples,
|
Reopening because I agree there are some points in this thread that can use more discussion. I will catch up with @smitpatel next week to make sure I understand his point of view. |
@mqudsi - Explain in detail the last case |
I think you're misunderstanding my proposal, @smitpatel. The same translation would be used, only further C# code would be executed prior to the execution of the translated db query. Here is my test project: For the following LINQ expression: var today = DateTime.UtcNow;
var likeToday = await db.Records
.OrderBy(x => x.Timestamp)
.Where(x => x.Timestamp.Day == today.Day)
.FirstOrDefaultAsync(); this is what EFCore currently maps it out to with the SQLite provider: SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__today_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1 With a simple function returning a static DateTime Now()
{
return DateTime.UtcNow;
}
likeToday = await db.Records
.OrderBy(x => x.Timestamp)
.Where(x => x.Timestamp.Day == Now().Day)
.FirstOrDefaultAsync(); the following SQL is translated: SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__Now_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1 which I was not expecting, as that behavior is actually what I am requesting. But perhaps the value was optimized by the compiler, so trying again: static DateTime Now2()
{
if (DateTime.UtcNow.Year == 2018)
{
return DateTime.UtcNow;
}
else
{
throw new Exception("It's the end of the world");
}
}
likeToday = await db.Records
.OrderBy(x => x.Timestamp)
.Where(x => x.Timestamp.Day == Now2().Day)
.FirstOrDefaultAsync(); which again generates the optimized SQL I was looking for: SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__Now2_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1 So maybe it's not the function call in and of itself that is the problem but rather the fact that the lambda is really simple? //Not on the nightlies, so .AddDays() is not yet supported
likeToday = await db.Records
.OrderBy(x => x.Timestamp)
.Where(x => x.Timestamp.Day == DateTime.UtcNow.AddDays(1).AddDays(-1).Day)
.FirstOrDefaultAsync(); Which finally gives me the situation I was looking for:
and the generated SQL returns all results for local filtering: SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
ORDER BY "x"."Timestamp" So you're actually already 99% of the way there. You're already evaluating functions and boiling them down to their return values before translating (if the initial expression could not be mapped). The only thing missing is support for call chains. As it currently stands, My proposal calls for only a very modest change (and I don't mean that in a Swiftesque manner). Just as var foo = db.Foos.Where(x => x.Day == Foo().Day) is translated to //pretend this is quoted correctly:
var sql = "SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__Now2_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1)";
using (var cmd = db.Database.Connection.CreateCommand(sql)) {
cmd.Parameters["@__Now2_Day_0"] = Now2().Day;
using (var reader = cmd.ExecuteReader()) {
....
}
....
} I would see var foo = db.Foos.Where(x => x.Day == DateTime.UtcNow.Day); translated to //pretend this is quoted correctly:
var sql = "SELECT "x"."Id", "x"."Timestamp"
FROM "Records" AS "x"
WHERE CAST(strftime('%d', "x"."Timestamp") AS INTEGER) = @__Now2_Day_0
ORDER BY "x"."Timestamp"
LIMIT 1)";
using (var cmd = db.Database.Connection.CreateCommand(sql)) {
cmd.Parameters["@__DateTime_UtcNow_Day"] = DateTime.UtcNow.Day;
using (var reader = cmd.ExecuteReader()) {
....
}
....
} |
I just realized I didn't address sargability; I was excited to find that substitution was already pretty powerful.
var date1 = new DateTime(Foo.Year, 1, 1);
var date2 = new DateTime(Foo.Year + 1, 1, 1); for any expression matching |
@mqudsi already tried something like this? static DateTime Now => DateTime.UtcNow;
likeToday = await db.Records
.OrderBy(x => x.Timestamp)
.Where(x => x.Timestamp.Day == Now.Day)
.FirstOrDefaultAsync(); |
Yes, creating a property works (see initial post). |
Yes! 👏 @mqudsi is doing a much better job of visually demonstrating what I was talking about: If a subtree cannot be translated into SQL, but it:
that subtree can be substituted with a parameter expression and lifted out into a block that is executed and applied to the corresponding |
@mqudsi, @tuespetre, I wanted to talk to @smitpatel because my initial reaction was that the behavior you described as the current way EF Core works was against some of our fundamental designs. But after talking to him I understand that things a bit more complicated 😄 Anyway, I want to thank you both for raising this. It helped us realize we needed to think harder. I will try to summarize the discussion @smitpatel and I had here so that you can understand what our thinking is and keep giving us feedback on it. We may later need to create new issues to track the specific improvements we would like to make to EF Core. In general, when we find a subexpression that is not store-correlated, we should be able to evaluate it on the client and pass the value in a parameter (i.e. funcletize it). From your comments, I believe we are all in agreement on this. So, why are we not doing it for this query? likeToday = await db.Records
.OrderBy(x => x.Timestamp)
.Where(x => x.Timestamp.Day == DateTime.Now.AddDays(1).AddDays(-1).Day)
.FirstOrDefaultAsync(); The answer is that the query uses DateTime.Now, which is in our list of non-deterministic expressions. What we believe we got rightWhen a provider connects to a server, it is important to use the current time from the server in queries for correctness and consistency. We want to evaluate such expressions n the server because if the time zone or current time of the server and the client are different we could return very inconsistent results. This applies to DateTime.Now and similar variations with UTC and DateTimeOffset. I am not sure if you will agree with this. From reading your comments it is not clear that this has been a concern for you, but it is for us. What we believe can be improvedCurrently, we achieve this goal by just having DataTime.Now (and the other variations) listed in EvaluatableExpressionFilter, which acts as a blacklist for funcletization. Unfortunately, for the example query, as soon as we realize that we cannot translate AddDays(), we switch the evaluation of the whole predicate to the client side. If the goal was to make sure we always evaluate DateTime.Now on the server, that is obviously not a great solution. We should instead:
Moreover, how many times should we invoke this? Right now
If we believed that DateTime.Now needs to be evaluated once per row to get correct results, the right soltuion would be in fact to issue N+1 queries. The only difference from what we are doign now is that the value of DateTime.Now would be obtained by executing But that isn't exactly the case. Although both DateTime.Now and Guid.NewGuid() are strictly speaking non-deterministic (they can return different result every time you call them), DateTime.Now will return the same value if you call it repeated times soon enough, and the server (in this case, SQL Server) can wisely avoid invoking it per row. For example look at the query plans for these two queries: -- This evaluates GETDATE() only once and the result becomes a constant
select r.*, GETDATE()
from Records as r
where r.OrderDate < GETDATE()
order by GETDATE();
-- This evaluates NEWID() once per row
select r.*, NEWID()
from Records as r
order by NEWID() I am trying to come up with a better explanation about the difference between these two. The best I have is that the non-determnism in GETDATE() is "soft", while the non-determinism in NEWID() is "hard" 😄. Furthermore, having Guid.NewGuid() in SummaryIf we separate how we deal with DateTime.Now (and friends) and how we deal with Guid.NewGuid() and Random.Next() on the other hand, we can probably come up with something like what you are asking for, with the only difference that we would make a single call to On sargability@mqudsi the approach to achieve sargability that you describe sounds quite clever. If this is something you think we could generally rewrite queries to (rather than having customers write the query like that), please create a separate issue. |
Great writeup.
It looks like the term SQL Server has used is “nondeterministic runtime constant” for |
Thanks! I will adopt this term. At least it sounds a bit better than soft non-determinism. |
I keep going back to a conversation I had with Smit back in February regarding group by constant behavior. Why do you hardcode stuff with a fixed strategy? Why does EF need a "THE PHILOSOPHY OF EF"? While I can appreciate conventions that avoid API consumers from making mistakes, there is also no room to say otherwise. As for how to separate things that operate per row vs. per batch, why not just call them as you described it.
In Voila, we have decoupled PHILOSOPHY of EF from the raw mathematics. Further, if you bite on my gambit so far, this raises an interesting API idea: Directly reifying the rewrite engine to the end user, so that they can choose Then reifying the batch zone could open up a whole slew of awesomeness to EF. For example, imagine defining a table #tmp0 (ID int primary key clustered) and seeding it with values, and doing a join on that table, instead of a typical IN clause translation. (This would allow EF to circumvent SQL Server's 2,100 parameter limit on stored procedures, because the temp table would hold the parameters, thereby eliminating the dreaded error Let me know if that makes sense. This is exactly how the query generator I wrote 10 years ago worked. Back then, I was greatly inspired by the academic language Maude, which implements a reflective rewriting logic as a language. Maude calls the ability to put constants like |
But that's only describing it's run-time mechanism of action. You appear to want to describe also the compile-time lexical scope of the action and whether each occurrence is a unique variable or unique assignable reference. |
Caveat: The property must be a non-Func. Any time the dependency is "deferred", it won't work. You have to capture the value: Works
Doesn't Work
This is rather annoying for integration tests, as you can see the problem especially if I capture DateTime.Now as Now() outside the query: EF won't see the DateTime.Now expression at all, and instead substitute in the client-side evaluated value. Perhaps this example best demonstrates the generality of my solution. |
@jzabroski I understand your points but there does need to be a “philosophy of EF”. Projects aren’t worth much without some kind of philosophy to them 😛 and even your points have a philosophy behind them. I agree that behaviors should be easy to modify in a query provider like this but that doesn’t mean the provider can’t have some (philosophically) sensible defaults that can be added to or removed from to taste. In EF Core there are a couple of really composable areas, namely in member and method handling. Past that I feel it’s a bit too rigid. Perhaps it could become less so in the future. All in all I think we are after the same pie in the sky here: a highly composable and useful query provider. |
@jzabroski I may be missing some details of what you are saying, but I believe I understand where you are going in general. I see the concrete goal in this case as coming up with a design that makes something like DateTime.Now behave consistently in the expected way regardless of whether it is being evaluated on the database server or on the client. The server already exists and it has its rules. Regardless of what we may want to do, the simplest way for it to be consistent is to try to build something that follows similar rules when the evaluation needs to happen in memory. Moreover, the whole problem space is very large, and the work we do involves a lot of divide-and-conquer and prioritizing stuff that we believe customers actually will use. Plus, as @tuespetre mentions, we need sensible defaults because only then you can use the product without deeply understanding all the details. In that sense, while we could build something that let's the user choose how many times a functions is evaluated at the query or invocation level (I am going to assume that is what you are proposing), I am somewhat skeptical about how much value that would yield. It seems for existing functions, it is reasonable for the expectation to be hard-coded (we just got it wrong in the original design). For new user defined or new function mappings, it certainly seems a good idea to have a way to configure it on a per function basis (cc @pmiddleton, @smitpatel). Re what you mention about creating temp tables and joining them with queries, it seems to me that for the scenarios in which this would actually help, e.g. there is a collection of values, TVPs on SQL Server would probably be a better option. For all other scenarios, if we just follow the same rules as the server, it seems that either inlining the function or using a simple parameter would be sufficient. |
@smitpatel could you please create new issues to cover what we discussed in triage? I think the main thing is separating the concerns currently conflated in the list in EvaluatableExpressionFilter. Part of the solution is already tracked at #11466. I wonder if we should have something about configuring this aspect functions when mapped. I remember we talked about having an attribute at some point, but I am not sure we ever implemented it. I'd like to hear what @pmiddleton thinks about it. Am I missing anything else? |
I have talked with @smitpatel on parts of this already. However Im currently at Disneyland and don't have time to write it up right now. I'll comment next week when I get home. It's funny how you never have any free time when you are on vacation :) |
...And I thought I was in the land of make believe. My own 2 cents:
Thanks for listening to all of us! |
Sorry for not getting a chance to chime in earlier, but I want to thank everyone for the vibrant, open, and healthy discussion here. For what it's worth, @divega's detailed response perfectly explained the situation and it now "makes sense" why things work the way they work. In particular, the bit about the @divega my first encounter with EFCore lambda translation had me anxious to see whether values such as
Thank you! It came about when I was trying to understand why WordPress queries were so inefficient out of the box despite plenty of indices sprinkled about and realized that the default permalink format involved a lookup against the year and month. The But to be honest, I'm not clear on how (blacklisting of |
Why would a TVP on SQL Server be a better option for the EFCore Query generator? With a TVP, you need to create a user-defined table type. Whose responsibility is that going to be, and why would EFCore want to take a dependency on whether the system catalog table sys.types has an appropriate TVP type mapping in place? The scenario I am suggesting here is where you have a lot of values. If anything, you want to create a temp table with a clustered index so that you can join quickly. Currently, this is what we do with EFCore: We use EntityFramework.Extended and bulk insert into a non-temp table the id's we need to join on. Alternatively, we use .Batch() extension method and then Union in-memory all the separate batches. Either one will avoid a run-time error, but the pseudo-temp table approach is the most performant. The downside is that it requires two columns, a uniqueidentifier column and the actual id int column to populate in the select, so that the sub-select becomes |
You're a very clever and curious engineer, as evidenced by your amazing WordPress anecdote... but there are indeed some basic academic concepts that can answer your question and perhaps raise your awareness to what your clever intuition and case analysis has revealed to you. You can think of any compiler as a rewrite engine. Compilers are constantly rewriting things. In rewriting logics, we think of the following properties:
The main reason I'm primarily in favor of using erecruit.Expr instead of complicating EntityFrameworkCore's internal query optimizer is that I can guarantee each optimization is limited to a specific "peep hole". However, at the same time, there are some optimizations I simply cannot do external to EF as they require some extensibility to EF itself. I'll update this very post with some persuasive examples, but for now I have to run out the door. |
@mqudsi @jzabroski I am still not sure if there is any direct connection between sargability and the original issue reported here. I believe @smitpatel was just curious to know if anything in @mqudsi proposal would make the query more sargable. Anyway, I would still like to encourage anyone that can propose sargability improvements we cann apply to EF Core query translation to do so in a separate issue. While I find this particular discussion very interesting, from the perspective of project management, it helps to keep issues that are narrow in focus and can easily be assigned to a person and a milestone for addressing. @jzabroski, TVPs solve the problem of streaming an arbitrary number of values up to the server. If you use a temp table, you still have the problem of populating it with the values. No doubts there are ways to do that, e.g. in chunks, but TVPs would likely be more efficient. An yes, it is unfortunate that TVPs require so much ceremony. I would much prefer to be able to declare the type inline or to be able to use existing tables a the template for TVPs. That said, if we enabled TVPs, some of the table types that we need (e.g. for inserts) would probably be generated and maintained by EF Core Migrations. For completely ad-hoc queries using Enumerable.Contains, we would probably need a table type with a single column of a specific scalar type. We could consider sending DDL to create the table type if it doesn't exist just before executing the query. All in all, I think the resulting experience could become quite good. |
IMHO based on my past experience dealing with time and time zones really sucks... like a lot :) I am in agreement that we should try to evaluate DateTime.Now/DateTime.UTCNow on the server whenever possible. However this is going to be a hard and messy issue to solve, because there are a bunch of edge cases. Take this query for example (it assumes the current released implementation of SqlLite without AddDays support) db.Records
.Where(x => x.Timestamp < DateTime.UtcNow)
.Where(x => x.Timestamp < DateTime.UtcNow.AddDays(-1)) This will end up using the current datetime from both the server and the client. The first where clause will be evaluated server side with a mapped call and the second will run client side with a local call. This is because the two lambdas are evaluated and translated in isolation. To solve this we will need to scan the expression tree while keeping context of what other sub-expressions have already run into, and then backtracking to perform substitutions if we run into a non-translatable method call. What is the rule in a N+1 query where the outer and inner both have DateTime.UtcNow calls but one is translatable and the other is not? How do we keep track of that decision as all the queries are built? How many more edge cases like this are lying out there that no one has thought of yet? Before we can even begin to tackle it we will need a metadata store for all mapped functions (see my comment in #12527) in order to have the data we need to make the evaluation decisions. A lot of the examples laid out in this thread are pretty contrived examples of taking things to the extreme (mine included). I think resources would be better spent trying to get more of the common DateTime methods translated for more providers. That will solve the majority of real use cases issues. |
Why not just add Roslyn analysis to warn engineers of mixed eval?
Probably also true to add a reliable Roslyn Analyzer. I've never written one but imagine with IoC it might be hard to statically infer much. In this example above, an analyzer can't suggest much without knowing the metadatastore. However, suppose the example was:
Then the analyzer could suggest writing the above as a between statement, in addition to eliminating mixed eval. Edit: I see that @ajcvickers has already scoped a Roslyn analyzer in #11659 , but the ticket doesn't explicitly scope out query analysis |
The list of which functions are mapped is not fully known until run time when the model is generated. This will preclude a Roslyn analyzer from being able to make any determinations. |
@pmiddleton Preclude means "prevent from happening; make impossible." You stated yourself that DateTime is such a concern and headache. Are you saying we can't write a Roslyn analyzer for:
This would limit the usefulness but why wouldn't that alone be worth it? For functions not analyzable due to run-time model generation, why not emit Information-level severity? Lastly, I must confess, I'm not an expert in Roslyn analyzers, but how does the basic Code Analysis in Visual Studio allow a CustomDictionary.xml? Couldn't EFCore follow a similar convention? Then architects could share their recommended "dictionaries" of "known server-side verbs" and "known server-side nouns" with others and override a plain default. EDIT BELOW THIS LINE 7/18/2018
|
We periodically close 'discussion' issues that have not been updated in a long period of time. We apologize if this causes any inconvenience. We ask that if you are still encountering an issue, please log a new issue with updated information and we will investigate. |
Following up on the discussion in #12517, I believe there's a fundamental issue with the way SQL translation is currently handled. If there is an expression that cannot be directly mapped to an SQL equivalent, but that expression evaluates to a data type that can be directly plugged in to an SQL query, then that expression should be evaluated and its result used instead of the SQL translation giving up and the query being evaluated locally rather than on the database server.
To illustrate with a concrete example:
In the latest published EFCore w/ SQLite provider releases, this cannot be evaluated in the database as
AddDays(xx)
does not have a translation implemented.However, simply doing the following makes SQL translation work and the query is fully evaluated on the DB server rather than the ASP.NET Core host:
More generally, if the translation of a given subexpression does not have a translation expressly provided by the provider, and that subexpression does not depend on the value of the record being evaluated (i.e. for a lambda
x => x == Foo(..)
,Foo
does not utilize the value ofx
) and the return type ofFoo(...)
can be used to directly evaluate the expression on the database server, then that subexpression should be evaluated and its result plugged in before translation is attempted once more.The text was updated successfully, but these errors were encountered: