Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: query hangs when using INSERT FROM ... SELECT on same table #28842

Closed
solongordon opened this issue Aug 20, 2018 · 42 comments · Fixed by #42862
Closed

storage: query hangs when using INSERT FROM ... SELECT on same table #28842

solongordon opened this issue Aug 20, 2018 · 42 comments · Fixed by #42862
Labels
A-sql-execution Relating to SQL execution. A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data)
Milestone

Comments

@solongordon
Copy link
Contributor

solongordon commented Aug 20, 2018

If you select from a table and insert the results into the same table, the query never returns, provided the result set is sufficiently large.

To reproduce:

CREATE TABLE t (x INT);
INSERT INTO t SELECT generate_series(1, 25000);
INSERT INTO t SELECT * FROM t;

The last query never completes. Based on the logs, it seems like this happens when the insert is large enough to trigger a range split, and then it causes an endless cycle of range splits. The splits were a symptom, not the cause. See discussion below.

This issue is present in both 2.0 and 2.1.

@solongordon solongordon added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 20, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label Aug 20, 2018
@tbg tbg changed the title core: query hangs when using INSERT FROM ... SELECT on same table storage: query hangs when using INSERT FROM ... SELECT on same table Aug 20, 2018
@tbg tbg added the S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. label Sep 20, 2018
@tbg tbg added this to the 2.1 milestone Sep 20, 2018
@nvanbenschoten nvanbenschoten self-assigned this Sep 25, 2018
@nvanbenschoten
Copy link
Member

This is interesting. We're intentionally inducing a recursive insertion. Is that valid? Are statements allowed to see values written in the same statement? I wonder what Postgres does.

@nvanbenschoten
Copy link
Member

I assume this only happens when the generate_series is large enough to cause the plan nodes to start chunking retrieval and insertion. Once that starts, later chunks will begin to observe the output of previous chunks and enter an infinite loop. (still speculation)

@nvanbenschoten
Copy link
Member

I wonder what Postgres does.

This seems to work fine in Postgres.

Are statements allowed to see values written in the same statement?

@knz could you weigh in here?

@knz
Copy link
Contributor

knz commented Sep 25, 2018

Are statements allowed to see values written in the same statement?

yes

@knz
Copy link
Contributor

knz commented Sep 25, 2018

(needs to be checked)

@nvanbenschoten
Copy link
Member

Are statements allowed to see values written in the same statement?

yes

If that's the case then this should be an infinite loop. In Postgres this simply doubles the number of rows in the table, which implies that subqueries do not observe values written in the same statement.

@knz
Copy link
Contributor

knz commented Sep 25, 2018

the issue really is to determine the consistency model. A single statement in pg that invokes a stored procedure will allow subsequent statements inside the stored procedure to see previous writes.

So the "step" of read-your-writes is not the top-level statement in pg's semantics, but something finer.

@nvanbenschoten
Copy link
Member

So the "step" of read-your-writes is not the top-level statement in pg's semantics, but something finer.

Right, but the problem here is that we're not executing the subquery fully before reading our own writes from the top-level statement.

@knz
Copy link
Contributor

knz commented Sep 25, 2018

is that the problem though? do you have evidence that pg loads all the rows of the subquery in a temp table or something similar before performing the insert?

@tbg tbg modified the milestones: 2.1, 2.2 Sep 25, 2018
@petermattis
Copy link
Collaborator

Feels like there needs to be a synchronization point between the reads and the writes in a query like this. If we could buffer the read in a temp table we'd have behavior similar to postgres.

Also, reminds me of the Halloween Problem.

@knz
Copy link
Contributor

knz commented Sep 25, 2018

yep looks like a halloween situation :)

For the record, postgres does not use a temp table

kena=> create table t(x int);
CREATE TABLE
kena=> explain(Verbose) insert into t table t;
                              QUERY PLAN
----------------------------------------------------------------------
 Insert on public.t  (cost=0.00..35.50 rows=2550 width=4)
   ->  Seq Scan on public.t t_1  (cost=0.00..35.50 rows=2550 width=4)
         Output: t_1.x
(3 rows)

@knz
Copy link
Contributor

knz commented Sep 25, 2018

I am starting to get a feel of what postgres is doing, thanks to an error I was asked to look at two months ago:

kena=> create table u(x int primary key);
CREATE TABLE
kena=> insert into u(x) values (1), (1) on conflict(x) do update set x = excluded.x+1;
ERROR:  ON CONFLICT DO UPDATE command cannot affect row a second time
HINT:  Ensure that no rows proposed for insertion within the same command have duplicate constrained values.

My working hypothesis is that:

  1. a table scan in pg will skip over values inserted within the same statement.
  2. if pg detects that an insert/update is attempting to update a row that was already inserted eearlier in the same statement it produces the aforementioned error

The error is meant to guarantee clients do not mis-exploit the apparent "sequence point" behavior introduced by the mechanism in 1.

@knz
Copy link
Contributor

knz commented Sep 25, 2018

(Regarding "skip over values inserted within the same statement" -- we can have a "statement count within a txn" on the intent and have the row reader / table reader skip over any rows with the same counter value as the statement performing the scan)

@knz
Copy link
Contributor

knz commented Sep 25, 2018

The answer about what pg does is detailed here:

https://www.postgresql.org/docs/10/static/indexam.html

Postgres separates the construction of indexes from the writing of data, that is:

  • every row in the table has a TID, ROWID (there is no such thing as "primary index" in pg's SQL/storage interface)

  • an INSERT/UPDATE will create the entries in storage with their TID/ROWID as they run

  • only at the end do they add the new TID/ROWID pairs into the indexes

So while the INSERT/UPDATE data source is running, it sees (via the index used for the query) only the rows that come from a previous statement.

@bdarnell
Copy link
Contributor

we can have a "statement count within a txn" on the intent and have the row reader / table reader skip over any rows with the same counter value as the statement performing the scan

Intents have a sequence counter (counting kv requests, not statements). Could that be part of the solution?

// A one-indexed sequence number which is increased on each request
// sent as part of the transaction. When set in the header of a batch
// of requests, the value will correspond to the sequence number of the
// last request. Used to prevent replay and out-of-order application
// protection (by means of a transaction retry).
int32 sequence = 7;

@knz
Copy link
Contributor

knz commented Sep 25, 2018

If the intent is updated per access and not per statement it's hard to determine whether a particular sequence number was generated by the current statement or another before that.

We'd need to collect the current value of that field for every intent laid. We may as well collect just the written PK without the intent which would telll us the same.

@knz
Copy link
Contributor

knz commented Sep 25, 2018

From Ben:

FWIW the sql standard says that the SELECT query "is effectively evaluated before insertion of any rows into T". so INSERT INTO t SELECT * FROM t is well-defined and doubles the size of the table
although supporting self-referencing operations is optional ("feature F781")
section 14.11 of the "foundation" pdf in sql 2008
general rule 5

@knz
Copy link
Contributor

knz commented Sep 25, 2018

The SQL standard and postgres are consistent on that point so we need to get there too.

There are two working directions:

  • extend the intent format to include the statement counter inside the txn, and compare that upon KV accesses:

    • during reads, skip over row if counter same as requesting stmt
    • during writes, error out with "duplicate write to same row"
  • inside the SQL mutation code keep a set data structure (a hashmap?) of all the PKs that have been touched by the current mutation. Compare it during mutations. This table needs to escape to disk because it can grow large.

@knz
Copy link
Contributor

knz commented Sep 25, 2018

The problem with the 2nd solution (SQL level) is that the SELECT will still see the values being inserted.

This can cause two problems:

  • if the SELECT "amplifies" what it reads (e.g. via some windowing) the mutation-level check will not be sufficient to prevent ballooning RAM usage

  • (more serious) if the query applies the SELECT clause to read "old values" with certain assumptions (e.g. that all the values are positive) and then writes "new values" that violate these assumptions (e.g. negative values), the SELECT can start failing, or compute bogus results, when it sees the dirty read (uncommitted write).

Let us not under estimate the severity of this 2nd problem. I am not concerned with cases where the SELECT would fail with an error because some dirty read pops up an error-generating value (e.g. zero yields division by zero). It's much more dangerous that the dirty read pops up a value that, after a projection, turns into bogus inserted data silently. That's is not client-detectable typically.

@nvanbenschoten
Copy link
Member

Like Ben mentioned, we already have a sequence count on each request, but this is not exposed to users of client.Txn. We could expose this and allow clients to specify the sequence that a request be run at, but this alone still would not be very useful. To make this behave the way we'd want, we'd need to make reads at sequences beneath intents ignore those intents instead of throwing replay errors (which is related to idempotency).

@nvanbenschoten nvanbenschoten added A-sql-execution Relating to SQL execution. and removed A-kv-client Relating to the KV client and the KV interface. labels Sep 25, 2018
@knz knz added A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. labels Sep 25, 2018
@andreimatei
Copy link
Contributor

@knz is it true that, although reads and writes alternate when executing the query in question, there are no reads being done concurrently with writes? As in, when there's a read in progress, the writer is blocked on the respective tableReader (which is in turn blocked on a rowFetcher)?

@knz
Copy link
Contributor

knz commented Sep 26, 2018

It's currently true yes but will not remain true once we start distributing the mutations.

@knz
Copy link
Contributor

knz commented Sep 26, 2018

Independently from Andrei's question, a followup on yesterday's writeup:

In any case the output of RETURNING must wait until the rows hit KV -- to check both for FK cascading updates/drops and duplicate row checks.

@andy-kimball
Copy link
Contributor

I think multiple intents per key is the right long-term architectural solution. I think there would be multiple dividends from allowing multiple intents, like avoiding the halloween problem without buffering, as well as enabling immutability and clean replay. AcidLib did this, and it worked really well. If we ever embark on big architectural changes to Core, this is something to keep in mind.

@nvanbenschoten
Copy link
Member

@andy-kimball This is a situation where an intent history per key for only a single txn (see #5861 (comment)), is actually all that's needed.

@ridwanmsharif
Copy link

In line with what @nvanbenschoten mentioned above, we've added intent history per key for transactions here: #32688

Transactions no longer replay when they encounter a higher sequence number, instead they assert the value they would write to the one written in the intent history, thus making transactions more idempotent: #33001

Reads will now be able to use this information to actually find the most appropriate value for a given key in: #33244

@knz
Copy link
Contributor

knz commented Jan 3, 2019

Heads up: the KV work to help with this is performed in #33244; this must be complemented by SQL-level work too, the tracking issue for this is #33473.

@jordanlewis jordanlewis added S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data) and removed S-2-temp-unavailability Temp crashes or other availability problems. Can be worked around or resolved by restarting. S-3-erroneous-edge-case Database produces or stores erroneous data without visible error/warning, in rare edge cases. labels Apr 30, 2019
@mattcrdb
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-sql-execution Relating to SQL execution. A-sql-mutations Mutation statements: UPDATE/INSERT/UPSERT/DELETE. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-3 Medium-low impact: incurs increased costs for some users (incl lower avail, recoverable bad data)
Projects
None yet
Development

Successfully merging a pull request may close this issue.