Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-20.1: sql: fix large UPSERTs #54418

Merged
merged 1 commit into from
Sep 16, 2020

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Sep 15, 2020

In #51944 I fixed a bug but introduced an even worse one. The original
bug was that if an UPSERT has RETURNING clause, we could have returned
incorrect values because internally resultCount and rowsUpserted
could be not synchronized. It was fixed by resetting resultCount,
however, the resetting was done unconditionally. This is incorrect
because resultCount variable is used by upsertNode.BatchedNext to
indicate whether there is more work to do (and if resultCount==0, then
we should finish). This bug would result in an UPSERT with or without
RETURNING clause of more than 10k rows actually process only 10k and
exit early. This is now fixed.

Relatedly, an UPSERT with RETURNING clause would incorrectly return no
rows when it was processing more than 10k rows.

Additionally, #51626 fixed a bug with pagination of UPSERTs which
exposed another bug when RETURNING clause is present - we were clearing
rowsUpserted in BatchedNext (as part of flushAndStartNewBatch
call), but that clear happens too early - we are accessing it after
BatchedNext returns with BatchedValues. This would lead to an index
out of bounds crush. Before #51626 there was no pagination done, so we
always had a single batch and flushAndStartNewBatch was never called
(to reset the row container too early). This is also now fixed. Note
that this second bug was impossible to run into because of the first one
(we would never reach this code).

Fixes: https://github.com/cockroachlabs/support/issues/598
Fixes: #54456.
Informs: #54465.

Release note (bug fix): CockroachDB in 20.1.4 and 20.1.5 releases could
finish UPSERT operation too early - namely, it would correctly insert
only up to 10000 rows and ignoring the rest. Furthermore, an UPSERT with
RETURNING clause in such scenario would return no rows (it would only
process 10k rows but return 0 rows).

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich changed the title sql: fix large UPSERTs release-20.1: sql: fix large UPSERTs Sep 15, 2020
@yuzefovich
Copy link
Member Author

Friendly ping - I'll be asking for an expedited release to include this fix.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I missed the original email.

Can you improve the comment for resultCount explaining the semantics more accurately? (e.g. not clear what "counted separately otherwise" means). And update the comments for batchedCount / batchedValues explaining what they return (the current comments refer to an interface that no longer exists)

I can't say I understand this, don't we want batchedCount() to return the number of rows in the last batch (at most 10,000)? E.g. won't upsertNode.BatchedNext now return true even if the last batch happens to be empty?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)

@yuzefovich
Copy link
Member Author

Can you improve the comment for resultCount explaining the semantics more accurately? (e.g. not clear what "counted separately otherwise" means).

I agree, it is confusing. I renamed the variable to be more explicit and refactored the code a bit, it should be more clear now.

And update the comments for batchedCount / batchedValues explaining what they return (the current comments refer to an interface that no longer exists)

Removed these methods entirely.

don't we want batchedCount() to return the number of rows in the last batch (at most 10,000)?

Good point, yes, I believe that is another bug that has been present for a long time. Fixed.

E.g. won't upsertNode.BatchedNext now return true even if the last batch happens to be empty?

Correct (this is a bug), but it was hidden because we're setting upsertNode.run.done to true whenever we reach the end - this bug would result in an extra call to BatchedNext that would short-circuit. Fixed.

@yuzefovich
Copy link
Member Author

Hm, I am confused how upsertNode.BatchedValues work - in BatchedNext we accumulate rows in rowsUpserted and then we're calling flushAndStartNewBatch which clears rowsUpserted; BatchedValues should be called after BatchedNext returns at which point the row container is empty. I feel like this could lead to a crash.

@yuzefovich
Copy link
Member Author

Yeah, it crashes on

CREATE TABLE t54456 (c INT PRIMARY KEY);
UPSERT INTO t54456 SELECT i FROM generate_series(1, 25000) AS i RETURNING c;

@yuzefovich
Copy link
Member Author

20.2.0-beta.1 also crashes but 20.1.3 doesn't - I guess my refactor introduced another problem too 🤦

@RaduBerinde
Copy link
Member

Maybe @knz can also take a look since he wrote some of the original code (I believe).

@yuzefovich
Copy link
Member Author

I think I figured it out - the second issue that UPSERTs with RETURNING clause don't work correctly when we have multiple batches has been hidden by the fact that the pagination had been broken (which was fixed by #51626), so prior to that we always had a single batch and would never run into the problem that was exposed once the pagination was fixed. RFAL.

@yuzefovich yuzefovich requested a review from knz September 16, 2020 18:58
Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)


pkg/sql/tablewriter_upsert_opt.go, line 153 at r1 (raw file):

	tu.resultCount = 0
	if tu.collectRows {
		tu.rowsUpserted.Clear(ctx)

We did not need this Clear?


pkg/sql/upsert.go, line 189 at r1 (raw file):

func (n *upsertNode) BatchedValues(rowIdx int) tree.Datums {
	if !n.run.tw.collectRows {
		panic("return row requested but collect rows was not set")

errors.AssertionErrorf

@yuzefovich yuzefovich force-pushed the fix-upsert-20.1 branch 3 times, most recently from b9f1db6 to ccc244d Compare September 16, 2020 19:45
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @knz and @RaduBerinde)


pkg/sql/tablewriter_upsert_opt.go, line 153 at r1 (raw file):

Previously, RaduBerinde wrote…

We did not need this Clear?

We need it, but at different time - it is done at the beginning of BatchedNext (similar to how other table writers do it).

In cockroachdb#51944 I fixed a bug but introduced an even worse one. The original
bug was that if an UPSERT has RETURNING clause, we could have returned
incorrect values because internally `resultCount` and `rowsUpserted`
could be not synchronized. It was fixed by resetting `resultCount`,
however, the resetting was done unconditionally. This is incorrect
because `resultCount` variable is used by `upsertNode.BatchedNext` to
indicate whether there is more work to do (and if `resultCount==0`, then
we should finish). This bug would result in an UPSERT with or without
RETURNING clause of more than 10k rows actually process only 10k and
exit early. This is now fixed.

Relatedly, an UPSERT with RETURNING clause would incorrectly return no
rows when it was processing more than 10k rows.

Additionally, cockroachdb#51626 fixed a bug with pagination of UPSERTs which
exposed another bug when RETURNING clause is present - we were clearing
`rowsUpserted` in `BatchedNext` (as part of `flushAndStartNewBatch`
call), but that clear happens too early - we are accessing it after
`BatchedNext` returns with `BatchedValues`. This would lead to an index
out of bounds crush. Before cockroachdb#51626 there was no pagination done, so we
always had a single batch and `flushAndStartNewBatch` was never called
(to reset the row container too early). This is also now fixed. Note
that this second bug was impossible to run into because of the first one
(we would never reach this code).

Release note (bug fix): CockroachDB in 20.1.4 and 20.1.5 releases could
finish UPSERT operation too early - namely, it would correctly insert
only up to 10000 rows and ignoring the rest. Furthermore, an UPSERT with
RETURNING clause in such scenario would return no rows (it would only
process 10k rows but return 0 rows).
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't see anything wrong here! 💯

Reviewed 2 of 3 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @RaduBerinde)

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)


pkg/sql/tablewriter_upsert_opt.go, line 153 at r1 (raw file):

Previously, yuzefovich wrote…

We need it, but at different time - it is done at the beginning of BatchedNext (similar to how other table writers do it).

Oh, sorry, missed that diff. LGTM

@yuzefovich
Copy link
Member Author

TFTRs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants