Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: UPDATE .. FROM distinct-on should apply before the LIMIT is applied #89823

Open
faizaanmadhani opened this issue Oct 12, 2022 · 5 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team

Comments

@faizaanmadhani
Copy link
Contributor

faizaanmadhani commented Oct 12, 2022

Describe the problem

When an ORDER BY clause is present in an UPDATE ... FROM statement, all the rows that are expected to be updated are not updated.

For example, in this logic test, it would be expected that the UPDATE statement would update both rows (1, 10) and (2, 20) to (1, 100) and (2, 100) respectively as both rows join on the first two rows of the VALUES v(i) table, however only one of the rows are updated.

statement ok
CREATE TABLE t (k int primary key, a int);
INSERT INTO t VALUES (1, 10), (2, 20);

statement ok
UPDATE t SET a = 100 FROM (VALUES (1),(2),(1)) v(i) WHERE k = v.i ORDER BY k LIMIT 2;

query II
SELECT * FROM t;
----
1  100
2  20

Additional data

Here is the query plan for the statement.

query T
EXPLAIN (OPT, VERBOSE) UPDATE t SET a = 100 FROM (VALUES (1),(2),(1)) v(i) WHERE k = v.i ORDER BY k LIMIT 2;
----
update t
 ├── columns: <none>
 ├── fetch columns: k:5 a:6
 ├── passthrough columns: column1:9
 ├── update-mapping:
 │    └── a_new:10 => a:2
 ├── cardinality: [0 - 0]
 ├── volatile, mutations
 ├── stats: [rows=0]
 ├── cost: 18.658453
 ├── distribution: test
 └── project
      ├── columns: a_new:10 k:5 a:6 column1:9
      ├── cardinality: [0 - 2]
      ├── stats: [rows=1.6151]
      ├── cost: 18.648453
      ├── key: (5)
      ├── fd: ()-->(10), (5)-->(6,9), (5)==(9), (9)==(5)
      ├── distribution: test
      ├── prune: (5,6,9,10)
      ├── distinct-on
      │    ├── columns: k:5 a:6 column1:9
      │    ├── grouping columns: k:5
      │    ├── internal-ordering: +(5|9)
      │    ├── cardinality: [0 - 2]
      │    ├── stats: [rows=1.6151, distinct(5)=1.6151, null(5)=0]
      │    ├── cost: 18.606151
      │    ├── key: (5)
      │    ├── fd: (5)-->(6,9), (5)==(9), (9)==(5)
      │    ├── distribution: test
      │    ├── top-k
      │    │    ├── columns: k:5 a:6 column1:9
      │    │    ├── internal-ordering: +(5|9)
      │    │    ├── k: 2
      │    │    ├── cardinality: [0 - 2]
      │    │    ├── stats: [rows=2, distinct(5)=1.6151, null(5)=0]
      │    │    ├── cost: 18.51
      │    │    ├── fd: (5)-->(6), (5)==(9), (9)==(5)
      │    │    ├── ordering: +(5|9) [actual: +5]
      │    │    ├── distribution: test
      │    │    ├── prune: (6)
      │    │    ├── interesting orderings: (+5)
      │    │    └── inner-join (lookup t)
      │    │         ├── columns: k:5 a:6 column1:9
      │    │         ├── key columns: [9] = [5]
      │    │         ├── lookup columns are key
      │    │         ├── cardinality: [0 - 3]
      │    │         ├── stats: [rows=3, distinct(5)=2, null(5)=0, distinct(9)=2, null(9)=0]
      │    │         ├── cost: 18.32
      │    │         ├── fd: (5)-->(6), (5)==(9), (9)==(5)
      │    │         ├── distribution: test
      │    │         ├── prune: (6)
      │    │         ├── interesting orderings: (+5)
      │    │         ├── values
      │    │         │    ├── columns: column1:9
      │    │         │    ├── cardinality: [3 - 3]
      │    │         │    ├── stats: [rows=3, distinct(9)=2, null(9)=0]
      │    │         │    ├── cost: 0.04
      │    │         │    ├── distribution: test
      │    │         │    ├── prune: (9)
      │    │         │    ├── (1,)
      │    │         │    ├── (2,)
      │    │         │    └── (1,)
      │    │         └── filters (true)
      │    └── aggregations
      │         ├── first-agg [as=a:6, outer=(6)]
      │         │    └── a:6
      │         └── first-agg [as=column1:9, outer=(9)]
      │              └── column1:9
      └── projections
           └── 100 [as=a_new:10]

Jira issue: CRDB-20457

@faizaanmadhani faizaanmadhani added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team labels Oct 12, 2022
@mgartner mgartner changed the title sql: UPDATE FROM ... ORDER BY ... returns unexpected output when an ORDER BY clause is present sql: UPDATE .. FROM distinct-on should apply before the LIMIT is applied Oct 12, 2022
@mgartner
Copy link
Collaborator

The problem here is that the distinct-on operator is applied after the limit operator. Postgres does not support ORDER BY .. LIMIT in UPDATE statements, so there is no pre-existing behavior we must follow. To me, the behavior illustrated above is unexpected, but I'm curious if anyone has an argument to keep the existing behavior, i.e., apply the limit before the distinct-on.

@msirek
Copy link
Contributor

msirek commented Oct 15, 2022

Maybe other database vendors disallow this syntax because it's confusing.
It's easy to misunderstand what the SQL is supposed to do. The UPDATE is just doing a join, like a SELECT would, but it's using the join to qualify rows to update.

An equivalent SELECT would be:
SELECT * FROM (VALUES (1),(2),(1)) v(i), t WHERE k = v.i ORDER BY k LIMIT 2;

Looking at this form, the ORDER BY clearly applies to the result of the join relation and the first two rows in the ordered join relation only contain the first row from table t. If the goal is to apply the LIMIT to distinct values of k instead of all join rows, this SQL would work using standard ANSI SQL:

UPDATE t SET a = 100 WHERE EXISTS(SELECT DISTINCT(k) FROM (VALUES (1),(2),(1)) v(i) WHERE k = v.i ORDER BY k LIMIT 2);

I wouldn't mind doing away with support for LIMIT and ORDER BY in an UPDATE FROM statement to force users to write SQL which is easier to understand. If we didn't have to worry about Postgres compatibility, we could get rid of UPDATE FROM altogether. I didn't see it mentioned in the ANSI SQL spec.

https://sqlserverfast.com/blog/hugo/2008/03/lets-deprecate-update-from/

@mgartner
Copy link
Collaborator

I don't think we'll be able to get rid of UPDATE .. FROM for PG compatibility and because it is sometimes a lot more ergonomic than using a subquery or another alternative. But I suspect that the confusing behavior between UDPATE .. FROM and ORDER BY .. LIMIT is one of the reasons that PG doesn't support ORDER BY .. LIMIT in UPDATE statements at all.

@mgartner mgartner moved this to Backlog (DO NOT ADD NEW ISSUES) in SQL Queries Jul 24, 2023
@msirek
Copy link
Contributor

msirek commented Sep 28, 2023

The only major database I could find which supports ORDER BY and LIMIT in an UPDATE statement is MySQL. And they don't support the FROM clause, meaning the ORDER BY can only be applied to the columns of the target table of the UPDATE. A statement like this fails because the ORDER BY column is not in scope:

UPDATE t1 set a = t1.a+1 WHERE t1.a IN (SELECT a FROM t2) ORDER BY t2.b LIMIT 1;

So, either we should remove support for ORDER BY ... LIMIT from UPDATE ... FROM, or, support it, but apply distinct-on before the LIMIT (as this issue requests), and further restrict the ORDER BY to reference only target table columns. Ordering by a FROM clause column doesn't seem to provide much value, and is non-standard, unless an example of another database vendor providing similar support can be found.

The original issue for supporting UPDATE ... FROM seems to be doing so for Postgres compatibility, so it may have just been an oversight to allow both UPDATE ... FROM and ORDER BY ... LIMIT in the same statement (as Postgres doesn't support ORDER BY ... LIMIT).

@mgartner
Copy link
Collaborator

mgartner commented Oct 5, 2023

I agree—we should disallow ORDER BY/LIMIT with UPDATE ... FROM. I have a vague memory of this also affecting DELETE ... USING, so we should check that too.

@mgartner mgartner moved this from Backlog (DO NOT ADD NEW ISSUES) to New Backlog in SQL Queries Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-queries SQL Queries Team
Projects
Status: Backlog
Development

No branches or pull requests

3 participants