colexecwindow: fix disk spilling in some cases #74491

yuzefovich · 2022-01-06T04:58:39Z

colserde: fix possible data corruption scenario during disk spilling

This commit fixes a possible data corruption (which would either result
in a silent wrong query result or an internal error) scenario that could
occur when the data is serialized/deserialized in the vectorized engine.
This would occur when the deserialized vectors are appended to, and it
was the most likely to occur with Bytes-like types (because their Sets
can behave like appends to a certain degree).

We need to deserialize the data in two paths - in the inbox after
reading from network and during the disk spilling. I believe that the
former is safe (since we don't modify those batches) and the latter is
mostly safe (since we tend to not modify the batches that we read from
disk). I think the only exception is the window functions.

Consider the following scenario: a batch with two Bytes vectors is
serialized. Say

the first vector is {data:[foo], offsets:[0, 3]}
the second vector is {data:[bar], offsets:[0, 3]}.
After serializing both of them we will have a flat buffer with something
like:
buf = {1foo031bar03} (ones represent the lengths of each vector).
Now, when the first vector is being deserialized, it's data slice will be
something like:
data = [foo031bar03], len(data) = 3, cap(data) > 3.
If we don't explicitly cap the slice and deserialize it into a Bytes
vector, then later when we append to that vector, we will overwrite the
data that is actually a part of the second serialized vector, thus,
corrupting it (or the next batch).

Release note (bug fix): Previously, CockroachDB could return incorrect
results or internal errors on queries with window functions returning
INT, FLOAT, BYTES, STRING, UUID, or JSON type when the disk spilling
occurred. The bug was introduced in 21.2.0 and is now fixed.

colexecwindow: make bytes-like output vector valid before spilling

bufferedWindowOp is special in a sense that its output vector is
appended by the vectorTypeEnforcer that is its input operator. That
output vector, thus, is part of the input batch, and the vector is
updated in incremental fashion (as the results are ready). It is also
possible that the input batch needs to be spilled to disk for the
operator to make progress. Previously, the output vector could be in an
invalid state (if it was bytes-like) because less elements were set on
the vector than the length of the batch. This is now fixed by making the
output vector valid before spilling.

Fixes: #70715.

Release note: None (because the previous commit contains very similar
info).

colexecwindow: fix min/max optimized window functions

This commit fixes a problem with MIN/MAX optimized window functions in
the vectorized engine. The problem was that we forgot to make an
explicit copy of the value before rewinding the spilling buffer (which
can make the previously retrieved value invalid).

Fixes: #74476.

Release note (bug fix): CockroachDB could previously incorrectly
calculate MIN/MAX when used as window functions in some cases after
spilling to disk. The bug was introduced in 21.2.0 and is now fixed.

distsql: force disk spilling in TestWindowFunctionsAgainstProcessor

Release note: None

cockroach-teamcity · 2022-01-06T04:58:46Z

This change is

This commit fixes a possible data corruption (which would either result in a silent wrong query result or an internal error) scenario that could occur when the data is serialized/deserialized in the vectorized engine. This would occur when the deserialized vectors are appended to, and it was the most likely to occur with Bytes-like types (because their `Set`s can behave like appends to a certain degree). We need to deserialize the data in two paths - in the inbox after reading from network and during the disk spilling. I believe that the former is safe (since we don't modify those batches) and the latter is mostly safe (since we tend to not modify the batches that we read from disk). I think the only exception is the window functions. Consider the following scenario: a batch with two Bytes vectors is serialized. Say - the first vector is `{data:[foo], offsets:[0, 3]}` - the second vector is `{data:[bar], offsets:[0, 3]}`. After serializing both of them we will have a flat buffer with something like: `buf = {1foo031bar03}` (ones represent the lengths of each vector). Now, when the first vector is being deserialized, it's data slice will be something like: `data` = `[foo031bar03]`, `len(data) = 3`, `cap(data) > 3`. If we don't explicitly cap the slice and deserialize it into a Bytes vector, then later when we append to that vector, we will overwrite the data that is actually a part of the second serialized vector, thus, corrupting it (or the next batch). Release note (bug fix): Previously, CockroachDB could return incorrect results or internal errors on queries with window functions returning INT, FLOAT, BYTES, STRING, UUID, or JSON type when the disk spilling occurred. The bug was introduced in 21.2.0 and is now fixed.

`bufferedWindowOp` is special in a sense that its output vector is appended by the `vectorTypeEnforcer` that is its input operator. That output vector, thus, is part of the input batch, and the vector is updated in incremental fashion (as the results are ready). It is also possible that the input batch needs to be spilled to disk for the operator to make progress. Previously, the output vector could be in an invalid state (if it was bytes-like) because less elements were set on the vector than the length of the batch. This is now fixed by making the output vector valid before spilling. Release note: None (because the previous commit contains very similar info).

This commit fixes a problem with MIN/MAX optimized window functions in the vectorized engine. The problem was that we forgot to make an explicit copy of the value before rewinding the spilling buffer (which can make the previously retrieved value invalid). Release note (bug fix): CockroachDB could previously incorrectly calculate MIN/MAX when used as window functions in some cases after spilling to disk. The bug was introduced in 21.2.0 and is now fixed.

Release note: None

rharding6373

Nice, thanks for the fixes!

Reviewed 2 of 2 files at r1, 1 of 1 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)

pkg/sql/colexec/colexecwindow/buffered_window.go, line 248 at r2 (raw file):

				// TODO(yuzefovich): it is quite unfortunate that the output
				// vector is being spilled to disk. Consider refactoring this.
				switch b.outputColFam {

Does the test in commit 4 demonstrate that this works as intended, or is another test needed?

pkg/sql/colexec/colexecwindow/min_max_removable_agg_tmpl.go, line 284 at r3 (raw file):

						}
						// Any values that could not fit in the queue would also have been
						// dominated by the current one, so reset omittedIndex.

For my education, what does dominated mean in this context?

pkg/sql/colexec/colexecwindow/min_max_removable_agg_tmpl.go, line 275 at r4 (raw file):

					// calls below might reuse the same underlying vector.
					var valCopy _GOTYPE
					execgen.COPYVAL(valCopy, val)

Does the test in commit 4 demonstrate that this works as intended, or is another test needed?

pkg/sql/distsql/columnar_operators_test.go, line 1224 at r4 (raw file):

							fmt.Printf("argIdxs: %v\n", argsIdxs)
							frame := windowerSpec.WindowFns[0].Frame
							fmt.Printf("frame mode: %v\n", frame.Mode)

nit: seems like these things don't change, so the print statements could be lifted out of the for loop.

yuzefovich

TFTR!

bors r+

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @rharding6373)

pkg/sql/colexec/colexecwindow/buffered_window.go, line 248 at r2 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Does the test in commit 4 demonstrate that this works as intended, or is another test needed?

Yeah, the test in commit 4 was able to find all of the bugs fixed in the first three commits, so I think that test is sufficient.

pkg/sql/colexec/colexecwindow/min_max_removable_agg_tmpl.go, line 284 at r3 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

For my education, what does dominated mean in this context?

Here is an example: imagine we have the following vector {1, 3, 5, 2} and say we're computing min window function with the frame as ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING (which means that we're computing min over the current value, value right before it if present, and value right after it if present).

Here is how we'll compute min using the "removable" strategy. We're maintaining a priority queue of values that are currently within the frame (where we're keeping track of the value as well as the index - the index is needed so that we could remove it when the corresponding value goes outside of the window frame). Values in min queue are increasing in their value and in their index.

computing min for idx = 0:
- queue is empty, values 1 and 3 enter the queue (because they just entered the window frame), so we now have queue = (1, 0), (3, 1). min is 1.
min for idx = 1:
- window frame is idx = {0, 1, 2}, so nothing needs to be removed from the queue, 5 enters the queue, so we have queue = (1, 0), (3, 1), (5, 2). min is 1.
min for idx = 2:
- window frame is idx = {1, 2, 3}, so idx = 0 has just exited from the frame and we have to remote if from the queue, now queue = (3, 1), (5, 2). 2 just entered the frame, but it dominates both values in the queue (because it has higher priority - lower value in this case - and higher index - so it'll stay inside the window frame for longer), now queue = (2, 3). min is 2.
min for idx = 3:
- window frame is idx = {2, 3}, nothing needs to be removed from the queue and nothing is added, so min is 2, and we're done.

Hope this makes sense.

pkg/sql/colexec/colexecwindow/min_max_removable_agg_tmpl.go, line 275 at r4 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Does the test in commit 4 demonstrate that this works as intended, or is another test needed?

Same as above, I think test in the commit 4 is sufficient.

pkg/sql/distsql/columnar_operators_test.go, line 1224 at r4 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

nit: seems like these things don't change, so the print statements could be lifted out of the for loop.

Note that this printing out happens only in case an unexpected error is encountered and we call t.Fatal below. We do not log all this information in the happy path, so I think the layout is as desirable.

yuzefovich

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @rharding6373)

pkg/sql/colexec/colexecwindow/buffered_window.go, line 248 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Yeah, the test in commit 4 was able to find all of the bugs fixed in the first three commits, so I think that test is sufficient.

BTW I stressed that test overnight, and it didn't find any problems with over 50k runs, so I have high confidence that we have flushed out all the edge cases.

craig · 2022-01-08T00:20:06Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich added the do-not-merge bors won't merge a PR with this label. label Jan 6, 2022

yuzefovich force-pushed the fix-wf-vec branch 2 times, most recently from 18a0143 to 9570486 Compare January 7, 2022 04:36

yuzefovich added 3 commits January 6, 2022 20:42

distsql: force disk spilling in TestWindowFunctionsAgainstProcessor

ef4f3eb

Release note: None

yuzefovich force-pushed the fix-wf-vec branch from 9570486 to ef4f3eb Compare January 7, 2022 04:43

yuzefovich changed the title ~~WIP on fixing vectorized window functions when spilled to disk~~ colexecwindow: fix disk spilling in some cases Jan 7, 2022

yuzefovich removed the do-not-merge bors won't merge a PR with this label. label Jan 7, 2022

yuzefovich marked this pull request as ready for review January 7, 2022 04:45

yuzefovich requested a review from a team as a code owner January 7, 2022 04:45

yuzefovich requested review from rharding6373 and cucaroach January 7, 2022 04:45

yuzefovich added the backport-21.2.x label Jan 7, 2022

rharding6373 approved these changes Jan 7, 2022

View reviewed changes

yuzefovich commented Jan 7, 2022

View reviewed changes

craig bot merged commit eddfd51 into cockroachdb:master Jan 8, 2022

blathers-crl bot mentioned this pull request Jan 8, 2022

release-21.2: colexecwindow: fix disk spilling in some cases #74589

Merged

yuzefovich deleted the fix-wf-vec branch January 8, 2022 00:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colexecwindow: fix disk spilling in some cases #74491

colexecwindow: fix disk spilling in some cases #74491

yuzefovich commented Jan 6, 2022 •

edited

Loading

cockroach-teamcity commented Jan 6, 2022

rharding6373 left a comment

yuzefovich left a comment

yuzefovich left a comment

craig bot commented Jan 8, 2022

colexecwindow: fix disk spilling in some cases #74491

colexecwindow: fix disk spilling in some cases #74491

Conversation

yuzefovich commented Jan 6, 2022 • edited Loading

cockroach-teamcity commented Jan 6, 2022

rharding6373 left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Jan 8, 2022

yuzefovich commented Jan 6, 2022 •

edited

Loading