Skip to content

Commit

Permalink
sql: change UPSERT to not require pre-fetching everything upfront
Browse files Browse the repository at this point in the history
**tl;dr:**

- lifts limitation, slight perf boost.
- clarifies the semantics of RETURNING when the same PK is updated
  multiple times.
- Initially motivated by work on a separate patch to introduce
  semantically correct behavior for mutation statements.

Details follow.

Prior to this patch, UPSERT with secondary indexes and INSERT ON
UPDATE DO SET ... would proceed as follows:

- it would load all the inserted rows into memory.
- it would pre-determine in one scan which of these
  already have data in KV (a conflict).
- after this initialization, it would then proceed to resolve the
  upsert, by either inserting new rows (for each source row for which
  no conflict was seen) or updating an existing row (for each source
  row where there was a conflict).

This situation caused three unrelated problems.

1) The first problem is that if there are many rows, the KV batch could
   blow up and fail / make everything slow. In comparison, we like the
   DELETE logic that was recently modified to operate using multiple
   short(er) batches. The UPSERT code should be able to evolve to do the
   same.

2) The second problem is that this code is unable to process two
   subsequent upserts to the same PK. For example, suppose that a table
   has no data to start with, and the following statement is issued:

       INSERT INTO kv(k) VALUES (1), (1) ON CONFLICT(k) DO UPDATE SET v = 2

   The initial processing described above would determine that there is
   no conflict for neither values being inserted. Then after this
   initialization, processing would start. The first iteration would see
   there was no conflict, and insert a fresh row.  The second iteration
   would see there was no conflict (again, the conflict information was
   computed initially), and (try to) insert a fresh row and fail with an
   error (row already exists).

   Because the conflict detection is done initially, it becomes impossible
   to properly upsert multiple rows with the same PK.

   Even worse, if the conflicting index is a secondary index, it is
   possible for UPSERT to complete without error but without processing
   the upsert properly, which is a plain correctness bug.

   To avoid both this latter bug and the conflict error, the previous
   code was equipped with a detection mechanism to reject multiple
   updates to the same PK. This check was necessary for correctness, but
   it was unfortunate because it rejects upserts that seem useful
   from a UX perspective.

3) for consistency with the check required by problem 2, the "fast
   path" of UPSERT was equipped with a check for the same
   condition. This is also rather unfortunate because the fast path
   uses different code that is not otherwise subject to the
   correctness problem above and could already support multiple
   updates to the same PK outright. The check was there thus only for
   UX consistency, so the user sees the same error regardless of which
   code path is used.

This patch aims to resolve problem 1, and ends up solving problems 2
and 3 as a side effect. It changes the main processing to be like
this:

- like before, it loads the rows and pre-loads the existing
  row information during the initialization.
- however, during the main loop, any insert or update feeds back
  the new row into the existing row information, so that
  subsequent iterations can access the updated data.

Because this new processing enables incremental upsert resolution,
this resolves problem 1, because now the processing can operate
in smaller batches.

Also it does so without the risk for correctness problems by feeding
earlier updates into later updates. So the check for duplicates can be
removed and the 2nd problem above is resolved as well.

And then since the slow/general code doesn't need this check, the fast
path code doesn't need it either any more (it was only implemented on
both sides for consistency). So this patch removes the check on the
fast path too and problem 3 is resolved as well.

Very small performance boost at large row counts:

```
name                                old time/op    new time/op    delta
SQL/Cockroach/Upsert/count=1-16        833µs ± 9%     805µs ± 4%    ~     (p=0.518 n=5+9)
SQL/Cockroach/Upsert/count=10-16      1.09ms ± 1%    1.09ms ± 2%    ~     (p=1.000 n=5+10)
SQL/Cockroach/Upsert/count=100-16     3.99ms ± 3%    3.77ms ± 9%    ~     (p=0.099 n=5+10)
SQL/Cockroach/Upsert/count=1000-16    28.0ms ±12%    25.4ms ± 4%  -9.47%  (p=0.003 n=5+8)

name                                old alloc/op   new alloc/op   delta
SQL/Cockroach/Upsert/count=1-16        113kB ± 1%     112kB ± 1%    ~     (p=0.206 n=5+10)
SQL/Cockroach/Upsert/count=10-16       155kB ± 1%     155kB ± 1%    ~     (p=0.768 n=5+10)
SQL/Cockroach/Upsert/count=100-16      716kB ± 2%     716kB ± 2%    ~     (p=0.859 n=5+10)
SQL/Cockroach/Upsert/count=1000-16    6.31MB ± 1%    6.29MB ± 1%    ~     (p=0.594 n=5+10)

name                                old allocs/op  new allocs/op  delta
SQL/Cockroach/Upsert/count=1-16        1.02k ± 1%     0.99k ± 1%  -2.32%  (p=0.001 n=5+10)
SQL/Cockroach/Upsert/count=10-16       1.39k ± 3%     1.35k ± 5%    ~     (p=0.218 n=5+10)
SQL/Cockroach/Upsert/count=100-16      5.39k ± 4%     5.23k ±15%    ~     (p=0.129 n=5+10)
SQL/Cockroach/Upsert/count=1000-16     45.7k ± 6%     43.5k ±11%    ~     (p=0.129 n=5+10)
```

This patch also makes RETURNING on an upserting operation return all
the rows in the data source that have caused a database update. If
multiple rows in the data source cause an update to the same row in
the database, there will be a RETURNING result for each update
separately.

This _extends_ the behavior of PostgreSQL. Postgres merely
[specifies](https://www.postgresql.org/docs/10/static/sql-insert.html):

> The optional RETURNING clause causes INSERT to compute and return
> value(s) based on each row actually inserted (or updated, if an ON
> CONFLICT DO UPDATE clause was used). This is primarily useful for
> obtaining values that were supplied by defaults, such as a serial
> sequence number. However, any expression using the table's columns
> is allowed. The syntax of the RETURNING list is identical to that of
> the output list of SELECT. Only rows that were successfully inserted
> or updated will be returned. For example, if a row was locked but
> not updated because an ON CONFLICT DO UPDATE ... WHERE clause
> condition was not satisfied, the row will not be returned.

However Postgres does not support multiple updates to the same
row. The previous patch introduced that support; this patch clarifies
and strengthens the semantics of RETURNING in that case.

For example, starting with:

```sql
CREATE TABLE kv(k INT PRIMARY KEY, v INT);
```

The following results are returned:

```sql
> UPSERT INTO kv(k) VALUES (1), (1), (2) RETURNING k,v;
+---+------+
| k |  v   |
+---+------+
| 1 | NULL |
| 2 | NULL |
+---+------+
(1 row)
```

The data source specifies only k, so this query specifies `v` is to
be left unchanged. The first upsert of `(1)` creates a new row `(1,
NULL`); the second upsert of `(1)` doesn't change anything, so doesn't
produce a RETURNING row; the last upsert of `(2)` adds a new row so is
reported.

Another example:

```sql
> UPDATE kv SET v = 1 WHERE k = 1;
> INSERT INTO kv(k) VALUES (1), (5) ON CONFLICT(k) DO UPDATE SET v = kv.v + 1 RETURNING k,v;
+---+------+
| k |  v   |
+---+------+
| 1 |    2 |
| 5 | NULL |
+---+------+
(2 rows)
```

Both rows caused an update; RETURNING always reflects the values post-update.

A more subtle situation:

```sql
> UPDATE kv SET v = 1 WHERE k = 1;
> UPSERT INTO kv VALUES (1), (1) RETURNING k,v;
+---+------+
| k |  v   |
+---+------+
| 1 | NULL |
| 1 | NULL |
+---+------+
(2 rows)
```

The query does not specify the set of target columns, so it is an
alias for `UPSERT INTO kv(k,v) VALUES (1, DEFAULT), (1, DEFAULT)`.
The DEFAULT value for `v` is NULL. So this query overwrites
`v` for the first row and that row is returned. The second row is also
an overwrite, so it is also returned.

```sql
> INSERT INTO kv VALUES (1), (4) ON CONFLICT(k) DO NOTHING RETURNING k,v;
+---+------+
| k |  v   |
+---+------+
| 4 | NULL |
+---+------+
(1 row)
```

The query does not specify the set of target columns, so it is an
alias for `INSERT INTO kv(k,v) VALUES (1,DEFAULT), (4,DEFAULT) ON
CONFLICT (k) DO NOTHING`. The conflict on the first row prevents the
update, so the RETURNING results don't contain it.

Release note (sql change): lifted a limitation that prevented UPSERTs
or INSERT ON CONFLICT DO UPDATE over multiple values with the same
primary key.

Release note (sql change): The behavior of `UPSERT` and `INSERT ... ON
CONFLICT` when a `RETURNING` clause is present is made more consistent
when an update touches the same row twice or more. This is a
CockroachDB-specific extension.
  • Loading branch information
knz committed Mar 15, 2018
1 parent a651ee2 commit a2d9d3f
Show file tree
Hide file tree
Showing 3 changed files with 490 additions and 189 deletions.
59 changes: 47 additions & 12 deletions pkg/sql/logictest/testdata/logic_test/upsert
Original file line number Diff line number Diff line change
Expand Up @@ -40,23 +40,33 @@ INSERT INTO kv VALUES (4, 10) ON CONFLICT DO UPDATE SET v = kv.v + 20
statement error duplicate key value \(k\)=\(3\) violates unique constraint "primary"
INSERT INTO kv VALUES (2, 10) ON CONFLICT (k) DO UPDATE SET k = 3, v = 10

# Until #23660 is fixed.
statement error UPSERT/ON CONFLICT DO UPDATE command cannot affect row a second time
UPSERT INTO kv VALUES (10, 10), (10, 11)

statement ok
INSERT INTO kv VALUES (9, 9) ON CONFLICT (k) DO UPDATE SET (k, v) = (excluded.k + 2, excluded.v + 3)

statement ok
UPSERT INTO kv VALUES (10, 10), (10, 11)

query II
UPSERT INTO kv VALUES (11, 10), (11, 12) RETURNING k, v
----
11 10
11 12

statement ok
INSERT INTO kv VALUES (13, 13), (7, 8) ON CONFLICT (k) DO NOTHING

statement error there is no unique or exclusion constraint matching the ON CONFLICT specification
INSERT INTO kv VALUES (13, 13), (7, 8) ON CONFLICT DO NOTHING

# Until #23660 is fixed.
statement error UPSERT/ON CONFLICT DO UPDATE command cannot affect row a second time
statement ok
INSERT INTO kv VALUES (14, 14), (14, 15) ON CONFLICT (k) DO UPDATE SET v = excluded.v + 1

statement ok
INSERT INTO kv VALUES (15, 15), (15, 16) ON CONFLICT (k) DO UPDATE SET k = excluded.k * 10

statement ok
INSERT INTO kv VALUES (16, 16), (16, 17) ON CONFLICT (k) DO UPDATE SET k = excluded.k * 10, v = excluded.v

query II
SELECT * FROM kv ORDER BY (k, v)
----
Expand All @@ -66,15 +76,35 @@ SELECT * FROM kv ORDER BY (k, v)
4 24
6 6
7 7
10 11
11 12
13 13
14 16
150 15
160 17

query II
query II rowsort
UPSERT INTO kv(k) VALUES (6), (8) RETURNING k,v
----
8 NULL

query II rowsort
INSERT INTO kv VALUES (10, 10), (11, 11) ON CONFLICT (k) DO UPDATE SET v = excluded.v RETURNING *
----
10 10
11 11

query II rowsort
INSERT INTO kv VALUES (10, 2), (10, 3) ON CONFLICT (k) DO UPDATE SET v = excluded.v + kv.v RETURNING *
----
10 12
10 15

query II rowsort
INSERT INTO kv VALUES (10, 14), (15, 15) ON CONFLICT (k) DO NOTHING RETURNING *
----
15 15

statement ok
CREATE TABLE abc (
a INT,
Expand All @@ -97,6 +127,11 @@ INSERT INTO abc VALUES (1, 2, 3)
statement ok
INSERT INTO abc VALUES (1, 2, 3) ON CONFLICT (c) DO UPDATE SET a = 4

query III
SELECT * FROM abc
----
4 2 3

statement ok
INSERT INTO abc VALUES (1, 2, 3) ON CONFLICT (c) DO UPDATE SET b = 5

Expand Down Expand Up @@ -179,7 +214,7 @@ statement ok
INSERT INTO upsert_returning VALUES (1, 1, NULL)

# Handle INSERT ... ON CONFLICT ... RETURNING
query IIII
query IIII rowsort
INSERT INTO upsert_returning (a, c) VALUES (1, 1), (2, 2) ON CONFLICT (a) DO UPDATE SET c = excluded.c RETURNING *
----
1 1 1 -1
Expand All @@ -192,7 +227,7 @@ INSERT INTO upsert_returning (a, c) VALUES (1, 1), (3, 3) ON CONFLICT (a) DO NOT
3 NULL 3 -1

# Handle UPSERT ... RETURNING
query IIII
query IIII rowsort
UPSERT INTO upsert_returning (a, c) VALUES (1, 10), (3, 30) RETURNING *
----
1 1 10 -1
Expand All @@ -210,15 +245,15 @@ INSERT INTO upsert_returning (a, b) VALUES (1, 1) ON CONFLICT (a) DO UPDATE SET
3

# Handle expressions within returning clause
query I
query I rowsort
UPSERT INTO upsert_returning (a, b) VALUES (1, 2), (2, 3), (4, 3) RETURNING a+b+d
----
2
4
6

# Handle upsert fast path with autocommit
query IIII
query IIII rowsort
UPSERT INTO upsert_returning VALUES (1, 2, 3, 4), (5, 6, 7, 8) RETURNING *
----
1 2 3 4
Expand All @@ -228,7 +263,7 @@ UPSERT INTO upsert_returning VALUES (1, 2, 3, 4), (5, 6, 7, 8) RETURNING *
statement ok
BEGIN

query IIII
query IIII rowsort
upsert INTO upsert_returning VALUES (1, 5, 4, 3), (6, 5, 4, 3) RETURNING *
----
1 5 4 3
Expand Down
Loading

0 comments on commit a2d9d3f

Please sign in to comment.