storage: query hangs when using INSERT FROM ... SELECT on same table #28842

solongordon · 2018-08-20T16:55:24Z

If you select from a table and insert the results into the same table, the query never returns, provided the result set is sufficiently large.

To reproduce:

CREATE TABLE t (x INT);
INSERT INTO t SELECT generate_series(1, 25000);
INSERT INTO t SELECT * FROM t;

The last query never completes. ~~Based on the logs, it seems like this happens when the insert is large enough to trigger a range split, and then it causes an endless cycle of range splits.~~ The splits were a symptom, not the cause. See discussion below.

This issue is present in both 2.0 and 2.1.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2018-09-25T15:17:34Z

This is interesting. We're intentionally inducing a recursive insertion. Is that valid? Are statements allowed to see values written in the same statement? I wonder what Postgres does.

nvanbenschoten · 2018-09-25T15:19:37Z

I assume this only happens when the generate_series is large enough to cause the plan nodes to start chunking retrieval and insertion. Once that starts, later chunks will begin to observe the output of previous chunks and enter an infinite loop. (still speculation)

nvanbenschoten · 2018-09-25T15:20:49Z

I wonder what Postgres does.

This seems to work fine in Postgres.

Are statements allowed to see values written in the same statement?

@knz could you weigh in here?

knz · 2018-09-25T15:38:10Z

Are statements allowed to see values written in the same statement?

yes

knz · 2018-09-25T15:38:16Z

(needs to be checked)

nvanbenschoten · 2018-09-25T15:43:40Z

Are statements allowed to see values written in the same statement?

yes

If that's the case then this should be an infinite loop. In Postgres this simply doubles the number of rows in the table, which implies that subqueries do not observe values written in the same statement.

knz · 2018-09-25T15:44:53Z

the issue really is to determine the consistency model. A single statement in pg that invokes a stored procedure will allow subsequent statements inside the stored procedure to see previous writes.

So the "step" of read-your-writes is not the top-level statement in pg's semantics, but something finer.

nvanbenschoten · 2018-09-25T15:48:32Z

So the "step" of read-your-writes is not the top-level statement in pg's semantics, but something finer.

Right, but the problem here is that we're not executing the subquery fully before reading our own writes from the top-level statement.

knz · 2018-09-25T15:52:05Z

is that the problem though? do you have evidence that pg loads all the rows of the subquery in a temp table or something similar before performing the insert?

petermattis · 2018-09-25T16:01:38Z

Feels like there needs to be a synchronization point between the reads and the writes in a query like this. If we could buffer the read in a temp table we'd have behavior similar to postgres.

Also, reminds me of the Halloween Problem.

knz · 2018-09-25T16:04:05Z

yep looks like a halloween situation :)

For the record, postgres does not use a temp table

kena=> create table t(x int);
CREATE TABLE
kena=> explain(Verbose) insert into t table t;
                              QUERY PLAN
----------------------------------------------------------------------
 Insert on public.t  (cost=0.00..35.50 rows=2550 width=4)
   ->  Seq Scan on public.t t_1  (cost=0.00..35.50 rows=2550 width=4)
         Output: t_1.x
(3 rows)

knz · 2018-09-25T16:08:49Z

I am starting to get a feel of what postgres is doing, thanks to an error I was asked to look at two months ago:

kena=> create table u(x int primary key);
CREATE TABLE
kena=> insert into u(x) values (1), (1) on conflict(x) do update set x = excluded.x+1;
ERROR:  ON CONFLICT DO UPDATE command cannot affect row a second time
HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.

My working hypothesis is that:

a table scan in pg will skip over values inserted within the same statement.
if pg detects that an insert/update is attempting to update a row that was already inserted eearlier in the same statement it produces the aforementioned error

The error is meant to guarantee clients do not mis-exploit the apparent "sequence point" behavior introduced by the mechanism in 1.

knz · 2018-09-25T16:10:08Z

(Regarding "skip over values inserted within the same statement" -- we can have a "statement count within a txn" on the intent and have the row reader / table reader skip over any rows with the same counter value as the statement performing the scan)

knz · 2018-09-25T16:26:23Z

The answer about what pg does is detailed here:

https://www.postgresql.org/docs/10/static/indexam.html

Postgres separates the construction of indexes from the writing of data, that is:

every row in the table has a TID, ROWID (there is no such thing as "primary index" in pg's SQL/storage interface)
an INSERT/UPDATE will create the entries in storage with their TID/ROWID as they run
only at the end do they add the new TID/ROWID pairs into the indexes

So while the INSERT/UPDATE data source is running, it sees (via the index used for the query) only the rows that come from a previous statement.

bdarnell · 2018-09-25T16:38:46Z

we can have a "statement count within a txn" on the intent and have the row reader / table reader skip over any rows with the same counter value as the statement performing the scan

Intents have a sequence counter (counting kv requests, not statements). Could that be part of the solution?

cockroach/pkg/storage/engine/enginepb/mvcc3.proto

Lines 59 to 64 in 15e4c8d

    
           // A one-indexed sequence number which is increased on each request 
        
           // sent as part of the transaction. When set in the header of a batch 
        
           // of requests, the value will correspond to the sequence number of the 
        
           // last request. Used to prevent replay and out-of-order application 
        
           // protection (by means of a transaction retry). 
        
           int32 sequence = 7;

knz · 2018-09-25T16:44:48Z

If the intent is updated per access and not per statement it's hard to determine whether a particular sequence number was generated by the current statement or another before that.

We'd need to collect the current value of that field for every intent laid. We may as well collect just the written PK without the intent which would telll us the same.

knz · 2018-09-25T16:47:00Z

From Ben:

FWIW the sql standard says that the SELECT query "is effectively evaluated before insertion of any rows into T". so INSERT INTO t SELECT * FROM t is well-defined and doubles the size of the table
although supporting self-referencing operations is optional ("feature F781")
section 14.11 of the "foundation" pdf in sql 2008
general rule 5

knz · 2018-09-25T16:49:37Z

The SQL standard and postgres are consistent on that point so we need to get there too.

There are two working directions:

extend the intent format to include the statement counter inside the txn, and compare that upon KV accesses:
- during reads, skip over row if counter same as requesting stmt
- during writes, error out with "duplicate write to same row"
inside the SQL mutation code keep a set data structure (a hashmap?) of all the PKs that have been touched by the current mutation. Compare it during mutations. This table needs to escape to disk because it can grow large.

knz · 2018-09-25T16:54:21Z

The problem with the 2nd solution (SQL level) is that the SELECT will still see the values being inserted.

This can cause two problems:

if the SELECT "amplifies" what it reads (e.g. via some windowing) the mutation-level check will not be sufficient to prevent ballooning RAM usage
(more serious) if the query applies the SELECT clause to read "old values" with certain assumptions (e.g. that all the values are positive) and then writes "new values" that violate these assumptions (e.g. negative values), the SELECT can start failing, or compute bogus results, when it sees the dirty read (uncommitted write).

Let us not under estimate the severity of this 2nd problem. I am not concerned with cases where the SELECT would fail with an error because some dirty read pops up an error-generating value (e.g. zero yields division by zero). It's much more dangerous that the dirty read pops up a value that, after a projection, turns into bogus inserted data silently. That's is not client-detectable typically.

nvanbenschoten · 2018-09-25T16:55:52Z

Like Ben mentioned, we already have a sequence count on each request, but this is not exposed to users of client.Txn. We could expose this and allow clients to specify the sequence that a request be run at, but this alone still would not be very useful. To make this behave the way we'd want, we'd need to make reads at sequences beneath intents ignore those intents instead of throwing replay errors (which is related to idempotency).

andreimatei · 2018-09-26T02:35:25Z

@knz is it true that, although reads and writes alternate when executing the query in question, there are no reads being done concurrently with writes? As in, when there's a read in progress, the writer is blocked on the respective tableReader (which is in turn blocked on a rowFetcher)?

knz · 2018-09-26T08:31:36Z

It's currently true yes but will not remain true once we start distributing the mutations.

knz · 2018-09-26T08:36:16Z

Independently from Andrei's question, a followup on yesterday's writeup:

In any case the output of RETURNING must wait until the rows hit KV -- to check both for FK cascading updates/drops and duplicate row checks.

andy-kimball · 2018-11-19T19:08:21Z

I think multiple intents per key is the right long-term architectural solution. I think there would be multiple dividends from allowing multiple intents, like avoiding the halloween problem without buffering, as well as enabling immutability and clean replay. AcidLib did this, and it worked really well. If we ever embark on big architectural changes to Core, this is something to keep in mind.

nvanbenschoten · 2018-11-19T19:24:54Z

@andy-kimball This is a situation where an intent history per key for only a single txn (see #5861 (comment)), is actually all that's needed.

ridwanmsharif · 2018-12-20T15:29:05Z

In line with what @nvanbenschoten mentioned above, we've added intent history per key for transactions here: #32688

Transactions no longer replay when they encounter a higher sequence number, instead they assert the value they would write to the one written in the intent history, thus making transactions more idempotent: #33001

Reads will now be able to use this information to actually find the most appropriate value for a given key in: #33244

knz · 2019-01-03T18:35:36Z

Heads up: the KV work to help with this is performed in #33244; this must be complemented by SQL-level work too, the tracking issue for this is #33473.

mattcrdb · 2019-10-14T17:56:01Z

A forum user ran into this limitation recently:

https://forum.cockroachlabs.com/t/performence-of-use-insert-select-statement-to-copy-a-table/3202

solongordon added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 20, 2018

tbg added the A-kv-client Relating to the KV client and the KV interface. label Aug 20, 2018

tbg changed the title ~~core: query hangs when using INSERT FROM ... SELECT on same table~~ storage: query hangs when using INSERT FROM ... SELECT on same table Aug 20, 2018

solongordon mentioned this issue Aug 20, 2018

sql: replace on long strings can cause OOM - needs accounting #28206

Closed

tbg added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Sep 20, 2018

tbg added this to the 2.1 milestone Sep 20, 2018

nvanbenschoten self-assigned this Sep 25, 2018

tbg modified the milestones: 2.1, 2.2 Sep 25, 2018

nvanbenschoten assigned knz Sep 25, 2018

nvanbenschoten added A-sql-execution Relating to SQL execution. and removed A-kv-client Relating to the KV client and the KV interface. labels Sep 25, 2018

knz added A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. labels Sep 25, 2018

knz mentioned this issue Dec 19, 2018

sentry: conn_executor.go:639: panic while executing 1 statements: UPSERT INTO _(_, _) VALUES ($1, $2), (__more4000__): caused by <redacted> #32834

Closed

knz removed their assignment Jan 2, 2019

knz mentioned this issue Jan 3, 2019

sql: use the new KV sequence field to separate reads from writes in SQL mutations #33473

Closed

3 tasks

knz mentioned this issue Nov 29, 2019

sql: make SQL statements operate on a read snapshot #42862

Merged

mattcrdb mentioned this issue Jan 6, 2020

bulk-io: CTAS creates more rows than expected when row count in millions #43752

Closed

craig bot closed this as completed in cdee208 Jan 20, 2020

knz mentioned this issue Sep 30, 2021

sql: multiple CTEs with mutations on same row can cause inconsistency #70731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: query hangs when using INSERT FROM ... SELECT on same table #28842

storage: query hangs when using INSERT FROM ... SELECT on same table #28842

solongordon commented Aug 20, 2018 •

edited

Loading

nvanbenschoten commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

knz commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

knz commented Sep 25, 2018

petermattis commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018 •

edited

Loading

knz commented Sep 25, 2018

bdarnell commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

andreimatei commented Sep 26, 2018

knz commented Sep 26, 2018

knz commented Sep 26, 2018

andy-kimball commented Nov 19, 2018

nvanbenschoten commented Nov 19, 2018

ridwanmsharif commented Dec 20, 2018

knz commented Jan 3, 2019

mattcrdb commented Oct 14, 2019

storage: query hangs when using INSERT FROM ... SELECT on same table #28842

storage: query hangs when using INSERT FROM ... SELECT on same table #28842

Comments

solongordon commented Aug 20, 2018 • edited Loading

nvanbenschoten commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

knz commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

knz commented Sep 25, 2018

petermattis commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018 • edited Loading

knz commented Sep 25, 2018

bdarnell commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

knz commented Sep 25, 2018

nvanbenschoten commented Sep 25, 2018

andreimatei commented Sep 26, 2018

knz commented Sep 26, 2018

knz commented Sep 26, 2018

andy-kimball commented Nov 19, 2018

nvanbenschoten commented Nov 19, 2018

ridwanmsharif commented Dec 20, 2018

knz commented Jan 3, 2019

mattcrdb commented Oct 14, 2019

solongordon commented Aug 20, 2018 •

edited

Loading

knz commented Sep 25, 2018 •

edited

Loading