-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: query hangs when using INSERT FROM ... SELECT on same table #28842
Comments
This is interesting. We're intentionally inducing a recursive insertion. Is that valid? Are statements allowed to see values written in the same statement? I wonder what Postgres does. |
I assume this only happens when the |
This seems to work fine in Postgres.
@knz could you weigh in here? |
yes |
(needs to be checked) |
If that's the case then this should be an infinite loop. In Postgres this simply doubles the number of rows in the table, which implies that subqueries do not observe values written in the same statement. |
the issue really is to determine the consistency model. A single statement in pg that invokes a stored procedure will allow subsequent statements inside the stored procedure to see previous writes. So the "step" of read-your-writes is not the top-level statement in pg's semantics, but something finer. |
Right, but the problem here is that we're not executing the subquery fully before reading our own writes from the top-level statement. |
is that the problem though? do you have evidence that pg loads all the rows of the subquery in a temp table or something similar before performing the insert? |
Feels like there needs to be a synchronization point between the reads and the writes in a query like this. If we could buffer the read in a temp table we'd have behavior similar to postgres. Also, reminds me of the Halloween Problem. |
yep looks like a halloween situation :) For the record, postgres does not use a temp table
|
I am starting to get a feel of what postgres is doing, thanks to an error I was asked to look at two months ago:
My working hypothesis is that:
The error is meant to guarantee clients do not mis-exploit the apparent "sequence point" behavior introduced by the mechanism in 1. |
(Regarding "skip over values inserted within the same statement" -- we can have a "statement count within a txn" on the intent and have the row reader / table reader skip over any rows with the same counter value as the statement performing the scan) |
The answer about what pg does is detailed here: https://www.postgresql.org/docs/10/static/indexam.html Postgres separates the construction of indexes from the writing of data, that is:
So while the INSERT/UPDATE data source is running, it sees (via the index used for the query) only the rows that come from a previous statement. |
Intents have a sequence counter (counting kv requests, not statements). Could that be part of the solution? cockroach/pkg/storage/engine/enginepb/mvcc3.proto Lines 59 to 64 in 15e4c8d
|
If the intent is updated per access and not per statement it's hard to determine whether a particular sequence number was generated by the current statement or another before that. We'd need to collect the current value of that field for every intent laid. We may as well collect just the written PK without the intent which would telll us the same. |
From Ben:
|
The SQL standard and postgres are consistent on that point so we need to get there too. There are two working directions:
|
The problem with the 2nd solution (SQL level) is that the SELECT will still see the values being inserted. This can cause two problems:
Let us not under estimate the severity of this 2nd problem. I am not concerned with cases where the SELECT would fail with an error because some dirty read pops up an error-generating value (e.g. zero yields division by zero). It's much more dangerous that the dirty read pops up a value that, after a projection, turns into bogus inserted data silently. That's is not client-detectable typically. |
Like Ben mentioned, we already have a sequence count on each request, but this is not exposed to users of |
@knz is it true that, although reads and writes alternate when executing the query in question, there are no reads being done concurrently with writes? As in, when there's a read in progress, the writer is blocked on the respective |
It's currently true yes but will not remain true once we start distributing the mutations. |
Independently from Andrei's question, a followup on yesterday's writeup: In any case the output of RETURNING must wait until the rows hit KV -- to check both for FK cascading updates/drops and duplicate row checks. |
I think multiple intents per key is the right long-term architectural solution. I think there would be multiple dividends from allowing multiple intents, like avoiding the halloween problem without buffering, as well as enabling immutability and clean replay. AcidLib did this, and it worked really well. If we ever embark on big architectural changes to Core, this is something to keep in mind. |
@andy-kimball This is a situation where an intent history per key for only a single txn (see #5861 (comment)), is actually all that's needed. |
In line with what @nvanbenschoten mentioned above, we've added intent history per key for transactions here: #32688 Transactions no longer replay when they encounter a higher sequence number, instead they assert the value they would write to the one written in the intent history, thus making transactions more idempotent: #33001 Reads will now be able to use this information to actually find the most appropriate value for a given key in: #33244 |
A forum user ran into this limitation recently: https://forum.cockroachlabs.com/t/performence-of-use-insert-select-statement-to-copy-a-table/3202 |
If you select from a table and insert the results into the same table, the query never returns, provided the result set is sufficiently large.
To reproduce:
The last query never completes.
Based on the logs, it seems like this happens when the insert is large enough to trigger a range split, and then it causes an endless cycle of range splits.The splits were a symptom, not the cause. See discussion below.This issue is present in both 2.0 and 2.1.
The text was updated successfully, but these errors were encountered: