-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use single SQL query instead of split queries #12098
Comments
I also tried it with a straight up LINQ query and hand-written projections and there is the same 2 queries generated. This isn't really an option for me though even if it did work, as I'm building a GraphQL service that passing required field names down from client and then I use the bookingDtos = db.Bookings
.AsNoTracking()
.Select(b => new BookingDto
{
Id = b.Id,
Name = b.Name,
BookingItems = b.BookingItems.Select(bi => new BookingItemDto
{
Id = bi.Id,
Name = bi.Name
}).ToList()
})
.ToArray();
|
Its not N+1, its just 2 queries. 2 queries instead of 1 is by design. What is your specific issue with that? |
Sorry, you're right. So just to clarify, you are saying it's by design to do 2 queries when it can be achieved with one using a JOIN? If that's the case then no worries, it's just surprising. |
@benmccallum - Yes it is by design. While it can be achieved with 1 query, it is not less likely to be performant. Single query for collection (include or join) causes dreaded cartesian product. |
Thanks heaps for taking the time to explain this @smitpatel. That makes total sense. Closing this out. |
there are 2 independent queries which means the second query does nothing with the first query, so still I don't understand how running 2 queries prevents data duplication! |
@elyasnew - They are not independent queries. 2nd query fetches the related data (for collection include) for the records in the first query. If you want all data in one query then you need to a join which will duplicate data in first query for every related record in 2nd query. |
Thanks for the clarification. I think it is really important to splitting the queries in some cases where you have too many duplicate data but can you make it optional in case some people doesn't want to use it that way ? Something like: |
@smitpatel I would like to reopen the discussion about this issue. I just wrote a single linq query and expected the database to spit out a single sql query. See the referenced issue for more information. But I was astonished that entity framework core did something totally different and even sorted out to join tuples in memory. This was unexpected. This said I do believe that when a developer writes one linq query he expresses the intention to run that linq query as a single sql query and writes it such that it can be run as a single sql query against the database. If the developer did not want to offload that operation to the database, he would write two linq queries and join the results in the application server. I do believe that the developer should have a choice. It is the developers responsibility to write linq quieries that implement the functional requirements in a sane way. The decision to generate two queries makes it impossible to implement streaming techniques of several hundredthousands of records with their related records while it silently increases the memory requirements of the machine that runs the application server. Can I somehow persuade the entity framework core team to rethink the original decision? I do not think that entity framework core team has to protect developers from writing bad linq queries. For instance, I would not mind if entity framework core threw an exception if it is unable generate a sensible sql query from a bad linq query. |
@smitpatel see related discussion in npgsql/efcore.pg#712 |
@dpsenner @roji Thanks, this is indeed an interesting discussion. I would like to point out that although EF Core creates a separate query whenever you project or include collections, for each of these cases there is generally an equivalent query (that is, a query that returns the exact same rows) that is translated by EF Core to a single query with JOINs. E.g.: // this will translate as three SQL queries
db.Blogs.Include(b=>b.Posts).ThenInclude(p=>p.Comments).Where(b=>b.Id == 1).ToList();
// this will translate as a single SQL query
db.Comments.Include(b=>b.Post).ThenInclude(b=>b.Blog).Where(b=>b.Post.Blog.BlogId == 1).ToList(); The difference between the two queries is that you are traversing the navigation properties in the opposite direction.
You are welcome to think that the distinction between the two queries (the direction in which you are traversing relationships) is arbitrary, and we could have used the same translation. However I would argue that we are simply using the order in which navigations are being traversed as a hint on how we translate the query. A useful way to think about how EF Core decides whether to translate to one or multiple SQL queries is that it is a heuristic that we apply to avoid the explosion of the cardinality of the root of the query that you are consuming. In a way, it is an attempt to achieve least astonishment, but focused on that specific aspect of the query. Also regarding the assertion that this doesn't allow streaming: for the first query, we actually still stream the results of Blog on the same query. We buffer the results from Posts and Comments, but if your database (and connection string) supports multiple active result sets (MARS), we only buffer them incrementally. E.g. we will buffer together all the posts for the first blog, then when you skip to the next blog, we will buffer all the posts for the second blog, etc. In any case, we are reopening this issue to discuss if there are other ways we can do this in the future. Obviously, it becomes much harder or impossible for you to write a query that gets resolved as single query if the reverse navigation properties are absent in the model, or if one of the leaf entities is an owned entity. |
@divega that's very useful information I didn't know - thanks! I'd suggest maybe adding this to the docs as I'm sure not people are aware of the distinction and the two SQL generation options. Maybe an "advanced query techniques" page would be appropriate for this kind of thing... At least IMO that's quite a satisfactory state of affairs... Even if the "default" mode (traversal from one to many) only buffers the one end, it's understandable that in some cases users may wish to avoid it, even at the expense of a cartesian product explosion. You seem to already provide a way to express it, and requiring that reverse navigation be present doesn't seem like a big problem to me. Owned entities may present a more serious problem though. One last somewhat-related note... Npgsql indeed does not support MARS - you can only have one reader open on a connection at a given point in time. However, PostgreSQL does support SQL server-side cursors, where you start a query with Just dumping a bit more context/options for us to think about... |
@roji agreed. Other two “crazy ideas” that come to mind re streaming and MARS:
|
I'm glad to see that ef core team is open to discuss this issue. @divega If a dbcontext begins to spawn connections I raise serious concerns regarding scalability. As for the two queries mentioned above, I have totally different expectations to how this should behave. // this will translate as three SQL queries
db.Blogs.Include(b=>b.Posts).ThenInclude(p=>p.Comments).Where(b=>b.Id == 1).ToList();
// this will translate as a single SQL query
db.Comments.Include(b=>b.Post).ThenInclude(b=>b.Blog).Where(b=>b.Post.Blog.BlogId == 1).ToList(); I expect both linq queries to map to one sql query each. If I did want to run this as separate queries I could implement it in the application in this way: // first query
var blogs = await dbContext.Blogs.Include(b => b.Posts).ToListAsync();
foreach (var blogPost in blogs.SelectMany(b => b.Posts)) {
// n queries
var blogPostComments = await dbContext.Comments.Where(c => c.PostId == blogPost.Id).toListAsync();
// do whatever needs to be done
} or even do fancy stuff with batching: // first query
var blogs = await .ToListAsync();
foreach (var blogs in dbContext.Blogs.InBatchesOf(100)) {
// fetch all blog posts for the set of blogs in one query and aggregate them into a lookup
var blogPostsLookup = dbContext.Posts.Where(p => blogs.Select(b => b.Id).ToArray().Contains(p.BlogId)).ToLookup(t => t.BlogId, t => t);
// then process all blog posts for each blog
foreach (var blog in blogs) {
foreach (var blogPost in blogPostsLookup[blog.Id]) {
// fancy stuff
}
}
} This said, linq to sql should map linq to sql. Not do fancy logic. |
Thanks for the tips about reversing the query traversal to make a single query. In my case, I'm translating graphql queries down to db queries via AutoMapper's For my situation to be optimized, I'd need to be able to tell EF that I'm only retrieving one blog somehow, or is it smart enough to recognise the query passes a primary key and/or |
Definitely an option! Although we'd have to watch out for transactions, which also affect reads...
Yeah - that resembles your 1st suggestion in a way: since every query can be dispatched to an arbitrary physical connection, there's no more constraint on having two queries on the same connection. The same caveat with transactions applies though. But multiplexing introduces a problem the other way - if you have multiple queries queued up for different producers on the same connection, then if the 2nd producer wants to process results but the 1st one still hasn't, then we have a problem... We can either delay the 2nd one (seems unacceptable) or buffer the 1st resultset in memory - in which we've killed streaming... I still have to think about multiplexing a bit more, but it definitely creates some issues in this area. |
@dpsenner, here's my take on your argument:
The problem, again, is that executing this kind of query in a single SQL query can be extremely bad for performance, because of the reasons that have already been explained above: duplication of principal entity fields, leading to tons of unnecessary data being transferred and thrown away. This simply doesn't seem like the right option for the default usecase in an average application. To summarize, when deciding what EF Core should do by default, the decision seems to be between:
To me, the decision between the two options isn't trivial. However, since the streaming issue isn't universal (MARS is supported in some providers and could be supported in others), and I don't really subscribe to the idea that one LINQ should provide one SQL, the 2nd really seems superior. One last point is whether the multiple queries produced in the 2nd option are batched. If they're not, we're adding more database network roundtrips, which isn't good - but this could be fixed in EF Core (#10878 tracks this). Finally, the idea suggested by @divega above could indeed mitigate the streaming concern when provider MARS isn't available, but I agree with your point that it hogs more connections and so could be problematic. |
I'm glad that you too share the opinion that entity framework core should not be allowed to spawn connections. As said, for me entity framework is an object relational mapping framework. As such it is responsible to map linq to a sql command and map the relational results to an object model. If a linq expression is badly formulated, it results in a NotImplementedException or NotSupportedException. This behavior is comparable to a badly formulated sql query. Nobody would expect the database engine to autocomplete missing aggregate functions in combination with group by. When faced with an exception, the developer will figure out a way to formulate a linq expression that gets the data from the database and mapped into the application server. I know very well that this is a very strict viewpoint that not many people share. But in this viewpoint entity framework core does no longer have to implement guesswork on linq query trees like the mentioned ordering of property traversals in the comment of @divega. Further, developers get the benefit of being able to implement sane and stable application servers. Of course some developers will write bad queries. So they will also be able to write code that crashes. Nobody expects from the compiler to protect them from doing so. |
I personally don't really agree with much of what you wrote (my views are my own):
I don't think there's anything in the idea of an ORM that requires a single LINQ query to be translated to a single SQL query - ORMs map objects to relational databases, and can do so in various ways. LINQ specifically is a pretty .NET-specific concept, so it seems odd to say that the concept of ORMs somehow necessarily implies something about how LINQ should be translated.
I don't really think the analogy holds well - there's a very big difference between an incorrect expression or query, and the question of how to translate a correct one. I think it's more appropriate to compare to compilers (or more accurately transpilers), which translate from one language to another. These tools routinely perform optimizations and other manipulations to make sure your code runs in the best way possible - that's one of their main job descriptions. Following your logic we may say that method inlining is a bad idea, because a C# method must always be translated to an IL and assembly method...
Agreed, but we're not talking about a bad query, but rather about what happens with the default, standard EF Core join construct. At the end of the day, if the proposal is that EF Core translate the standard join syntax ( |
Of course this should not happen.
Bad performance can also happen when a queried database column is missing an index or if the index has bad selectivity.
Entity framework core should behave consistently and not do unexpected things. To me it was and is still unexpected that one linq query runs as two sql queries because I, being a human, would write it as one sql query. Including hundreds of columns from different tables and joining all together will always end up with huge result sets. I would not expect entity framework core to do any magic on this. But is it really entity framework core's responsibility to deal with these |
No, I did not in this case as I just found out that it issues so many queries, but going across processes to the database more often than necessary, compiling of the query by the database... I'd definitely expect that taking more time then compiling one query and returning let's say 40 bytes of additional data per row. |
OK, so I've spent a couple of hours measuring the difference.... var bytForRoleId = (byte)forRole;
bytForRoleId = 0;
var w = new System.Diagnostics.Stopwatch();
w.Start();
var q1 = (
from r in DC.OrganizationUserRoles
where bytForRoleId == 0 || (bytForRoleId > 0 && r.OrgUserRoleId == bytForRoleId)
orderby r.OrgUserRoleId
select new OrgRolePermissionInfo() {
RoleId = r.OrgUserRoleId,
RoleName = r.OrgUserRoleName,
RoleDescr = r.OrgUserRoleDescr,
Permissions = (
from dp in r.RolePermissions
from p in DC.RolePermissions.Where(x => x.OrgId == forOrgId && x.OrgUserRoleId == dp.OrgUserRoleId && x.PermissionId == dp.PermissionId).DefaultIfEmpty()
where dp.OrgId == 0
select new OrgRolePermission() {
Permission = dp.Permission,
DefaultValue = dp.PermissionValue,
Value = p.PermissionId > 0 ? p.PermissionValue : dp.PermissionValue,
Inherited = p.PermissionId > 0 ? false : true,
Types = dp.Permission.PermissionApplicables.ToList()
}
).ToList()
}
).ToList();
w.Stop();
var ms1 = w.ElapsedTicks;
w.Restart();
var q2 = (
from a in DC.PermissionApplicables
from rp in DC.RolePermissions.Where(x => x.PermissionId == a.PermissionId).DefaultIfEmpty()
from p in DC.RolePermissions.Where(x => x.OrgId == forOrgId && x.OrgUserRoleId == rp.OrgUserRoleId && x.PermissionId == rp.PermissionId).DefaultIfEmpty()
from pp in DC.Permissions.Where(x => x.PermissionId == rp.PermissionId)
from r in DC.OrganizationUserRoles.Where(x => bytForRoleId == 0 || (bytForRoleId > 0 && x.OrgUserRoleId == bytForRoleId)).DefaultIfEmpty()
where rp.OrgId == 0 && (pp.PermissionId != 0 && pp.PermissionId != 3)
orderby r.OrgUserRoleId, a.PermissionId, a.CanWhat
select new Test {
RoleId = r.OrgUserRoleId,
RoleName = r.OrgUserRoleName,
RoleDescr = r.OrgUserRoleDescr,
Perm = pp,
DefaultValue = rp.PermissionValue,
Value = (p.PermissionId > 0 ? p.PermissionValue : rp.PermissionValue),
Inherited = p.PermissionId > 0 ? false : true,
PermId = a.PermissionId,
CanWhat = a.CanWhat,
UiText = a.UiText,
UiExplanation = a.UiExplanation,
Includes = a.Includes,
ApplicableToAppUser = a.ApplicableToAppUser
}
).ToList();
w.Stop();
var ms2 = w.ElapsedTicks;
w.Restart();
Test row;
byte iLastRoleId = 0;
byte iLastPermId = 0;
short iLastCanWhat = 0;
OrgRolePermissionInfo o = null;
var lst = new List<OrgRolePermissionInfo>();
OrgRolePermission rolePerm = null;
for (var i = 0; i < q2.Count; i++) {
row = q2[i];
if(row.RoleId != iLastRoleId) {
iLastRoleId = row.RoleId;
iLastPermId = 0;
iLastCanWhat = 0;
o = new OrgRolePermissionInfo() {
RoleId = row.RoleId,
RoleName = row.RoleName,
RoleDescr = row.RoleDescr,
Permissions = new List<OrgRolePermission>()
};
lst.Add(o);
}
if(row.PermId != iLastPermId) {
iLastPermId = row.PermId;
iLastCanWhat = 0;
rolePerm = new OrgRolePermission() {
Permission = row.Perm,
DefaultValue = row.DefaultValue,
Value = row.Value,
Inherited = row.Inherited,
Types = new List<PermissionApplicable>()
};
o.Permissions.Add(rolePerm);
}
if(row.CanWhat != iLastCanWhat) {
iLastCanWhat = row.CanWhat;
rolePerm.Types.Add(new PermissionApplicable() {
PermissionId = row.PermId,
ApplicableToAppUser = row.ApplicableToAppUser,
CanWhat = row.CanWhat,
Includes = row.Includes,
UiExplanation = row.UiExplanation,
UiText = row.UiText
});
}
}
w.Stop();
var ms3 = w.ElapsedTicks; |
Comments to the above code: I need to write queries like this all the time. Yet, according to @roji this is not a common case? Can I do it better? I'm not happy at all with a 5-fold decrease in performance or more with more return objects expected. With so many queries, the DB server will not be very scaleable. :-( |
@marekvse you may have missed it, but the decision has already been made to move to a single SQL query (with classical join) for all cases - see this comment above. FWIW whether a single query will perform better than one-query-per-joined-table depends on your specific query, various environmental concerns (e.g. network latency), etc. It's very easy to show scenarios in which the single query approach will be very bad for perf. However, it's true that single query has several non-perf advantages (always allows streaming regardless of MARS support, potential subtle transaction isolation issues..), and users should be able to get the previous behavior by coding for it explicitly. |
Yes, I indeed did miss that. Thank you @roji . I just updated from EF Core 2.2.0 to 2.2.4, but I get the same result - the same 50 queries instead of one for q1 query from the code above. Is there a setting I have to change or alter somehow the way the query is written? |
I see. Thanks @roji. Will be looking forward to it then :). |
Glad to hear this is being fixed in EF Core 3.0. |
Bugs fixed - Allow ToList on collection projection to be called with less derived type - Project out necessary identifiers for collection join when pushing down. (Except distinct case) - Lift ordering from inner collection to generated ordered result (This would be improved further by #16226) - Map Cast from Enumerable to Queryable - Translate AsQueryable as no-op Resolves #12098 Resolves #15043 Resolves #15611 Part of #15711
Bugs fixed - Allow ToList on collection projection to be called with less derived type - Project out necessary identifiers for collection join when pushing down. (Except distinct case) - Lift ordering from inner collection to generated ordered result (This would be improved further by #16226) - Map Cast from Enumerable to Queryable - Translate AsQueryable as no-op Resolves #12098 Resolves #15043 Resolves #15611 Part of #15711
Bugs fixed - Allow ToList on collection projection to be called with less derived type - Project out necessary identifiers for collection join when pushing down. (Except distinct case) - Lift ordering from inner collection to generated ordered result (This would be improved further by #16226) - Map Cast from Enumerable to Queryable - Translate AsQueryable as no-op Resolves #12098 Resolves #15043 Resolves #15611 Part of #15711
@smitpatel @ajcvickers There is an order by Id exists in the resulted query above although the linq doesn't contains any order by. I'm asking about this as it could cause a performance issue specially in case this "Id" is used only as a primary key, but not clustered index |
Note: EF Core 5.0 allows you to choose from split or single queries.https://docs.microsoft.com/en-us/ef/core/querying/single-split-queries |
I've been investigation an issue when using AutoMapper's Queryable projection extensions on top of EF Core 2.1-rc1-final. Full details including troubleshooting logs with the SQL statements are at the issue (AutoMapper/AutoMapper#2639) and EfAutomapperTroubleshooting.zip is a repro.
Basically I've got a
Booking
which has multipleBookingItem
s. I have a DTO for each type and have configured them to be used by AutoMapper. When I execute the query on EF6 the JOIN optimization occurs (it's even smart enough that I don't need to specific the.Include(b => b.BookingItems)
. On EF Core however, two DB queries fire (one for Booking and the other for BookingItems). Explicitly specifying theInclude
potentially would work, but it's ignored by the "IgnoredIncludes" cleverness, so I can't force that behaviour even if I wanted to.So there's two issues running on EF Core I'm trying to get past:
Obviously understanding that EF Core is not EF 6, but just wondering if this would ever be supported as it's a common scenario these days. The AutoMapper projection extensions are awesome. I'll try do this without AutoMapper for now and see if that can be my workaround for now.
Steps to reproduce
See linked issue logged at AutoMapper and the downloadable .sln zip in my later comment.
Further technical details
EF Core version: 2.1.0-rc1-final
Database Provider: Microsoft.EntityFrameworkCore.SqlServer
Operating system: Windows 10 Pro, Version: 1803, OS build: 17134.48
IDE: Visual Studio 2017 15.7.1
The text was updated successfully, but these errors were encountered: