Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop producing duplicate subqueries in SQL #32277

Open
Tracked by #34524
roji opened this issue Nov 12, 2023 · 0 comments
Open
Tracked by #34524

Stop producing duplicate subqueries in SQL #32277

roji opened this issue Nov 12, 2023 · 0 comments

Comments

@roji
Copy link
Member

roji commented Nov 12, 2023

Here's a thought dump on SQL duplication. There's no definite proposal yet below, this requires more thinking; ultimately, this may be more about how we translate GroupBy and handle pending selectors.


The following LINQ query (see #32234 (comment) for the full repro):

_ = await context.IterationValues
    .Where(r => ids.Contains(r.IterationId))
    .GroupBy(x => x.IterationId)
    .Select(x => new
    {
        x.Key,
        MaxTimestamp = x.Select(x => x.Iteration.Timestamp).Max(),
    })
    .OrderBy(x => x.MaxTimestamp)
    .ToListAsync();

... produces the following SQL:

SELECT [i].[IterationId] AS [Key], (
    SELECT MAX([i5].[Timestamp])
    FROM [IterationValues] AS [i4]
    INNER JOIN [Iteration] AS [i5] ON [i4].[IterationId] = [i5].[Id]
    WHERE [i4].[IterationId] IN (
        SELECT [i6].[value]
        FROM OPENJSON(@__ids_0) WITH ([value] int '$') AS [i6]
    ) AND [i].[IterationId] = [i4].[IterationId]) AS [MaxTimestamp]
FROM [IterationValues] AS [i]
WHERE [i].[IterationId] IN (
    SELECT [i0].[value]
    FROM OPENJSON(@__ids_0) WITH ([value] int '$') AS [i0]
)
GROUP BY [i].[IterationId]
ORDER BY (
    SELECT MAX([i5].[Timestamp])
    FROM [IterationValues] AS [i4]
    INNER JOIN [Iteration] AS [i5] ON [i4].[IterationId] = [i5].[Id]
    WHERE [i4].[IterationId] IN (
        SELECT [i3].[value]
        FROM OPENJSON(@__ids_0) WITH ([value] int '$') AS [i3]
    ) AND [i].[IterationId] = [i4].[IterationId])

There is considerable subquery duplication in the above SQL: the ordering and projection contains the exact same subquery, and the OPENJSON subquery contained within it is also duplicate in the WHERE clause. This is quite inefficient, causing the database to evaluate the subquery multiple times.

  • One eager approach here would be to perform a pushdown in translation, when an operator is being applied that would cause duplication. For example, with the above query, when the OrderBy is applied, we'd detect that the lambda references a SelectExpression that's also in the projection, do a pushdown, and then apply the ordering on top of that. This has the disadvantage of being probably too eager: if a further Select is then applied which removes the subquery from the projection, the pushdown is no longer necessary, but we already did it too early. Also, finding out whether the OrderBy references the projection isn't trivial (either an additional visitation on all lambdas just to detect this, or to change the way we translate lambdas, passing them more context).
  • Another approach would be to add a phase in post-processing, where we find all SelectExpressions which are referenced multiple times, and extract those out to CTEs; this would add one pass for discovery, and another one to apply the transformation if needed (to optimize this we may gather the discovery information as part of translation and pass it forward to post-processing). There are probably some other CTE-related post-processing needs, so this visitor probably makes sense as a new post-processing step.
    • For the ordering and projection duplication above, a CTE may be slightly less great than a simple subquery pushdown (even if perf-wise they should be the same). We could detect the possibility to push down instead of introduce a CTE (because the duplication is between the projection and another clause, i.e. ordering), but that may be too much trouble for what its worth.
    • Note that this second approach needs to happen right after translation, before type mapping inference visitation happens - or any other visit which may rewrite the tree. This is because such a visitor may cause the SelectExpression to get duplicated, and we lose the multiple referencing.

However, note that the WHERE clause itself also gets duplicated in the above ordering and projection. The ideal way to rewrite this query is to perform a pushdown right after the WHERE clause, so that the IN operator is only evaluated once. That does point towards approach (1), i.e. detecting that the projection contains a complex, potentially expensive expression which is better off being pushed down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants