-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colserde: fix the edge case with nulls handling #62642
Conversation
To provide a bit more context, the scenario of hitting out of bounds error is actually because we call
I think the code linked above is no longer relevant, so I'll open up a separate commit for it that won't be backported. But I believe this change should be backported even to 20.2. |
Actually, I'm no longer convinced that this fix is full, and I start to think that #62644 is actually needed for correctness. Consider the following scenario:
Typing this out made me realize that our no-copy conversion from arrow to coldata format for data where we just call @asubiotto I'm curious to hear your thoughts about these scenarios, possibly I'm missing something. |
c47fe35
to
d2b3d79
Compare
Hm, I tried to repro problems described in the previous comment in a unit test and didn't succeed. It seems like I'm just missing something. Still #62644 makes the most sense to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much context on this part of the code any more so don't have anything intelligent to say about the scenario you mentioned. But my intuition is that you're talking about in-memory state. When you rewind a queue, you invalidate all that and read from disk again. Have you figured out exactly why your unit test case is passing?
Reviewed 6 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rytaft)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, 2 of 2 files at r2.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/col/colserde/arrowbatchconverter.go, line 489 at r2 (raw file):
// The current null bitmap doesn't have enough space, so we need to // allocate a new one. nulls := coldata.NewNulls(batchLength)
would be good to add an explanation here similar to what's in the commit message.
d2b3d79
to
68ec4ac
Compare
When serializing the data of Bool, Bytes, Int, and Float types when they don't have any nulls in the vector, we don't explicit specify the null bitmap. Previously, when deserializing such vectors with no nulls we would simply call `UnsetNulls` on the `coldata.Nulls` object that is currently present. However, it is possible that already present nulls object cannot support the desired batch length. This could lead to index out of bounds accesses. Note that in the vast majority of cases this likely doesn't happen in practice because we check `MaybeHasNulls`, and that would return `false` making us omit the null checking code. Release note (bug fix): Previously, CockroachDB could encounter an internal error in rare circumstances when executing queries via the vectorized engine that operate on columns of BOOL, BYTES, INT, and FLOAT types that have a mix of NULL and non-NULL values.
68ec4ac
to
60f03a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But my intuition is that you're talking about in-memory state. When you rewind a queue, you invalidate all that and read from disk again. Have you figured out exactly why your unit test case is passing?
Yeah, I think I understand why the unit test I was making wasn't triggering the scenario I described above, and I think your comment about the in-memory state is on point. What I was missing there was the fact that the on-disk state is never modified, and because Dequeue
invalidates the result of the previous call (to be more precise, the caller of Dequeue
always passes in the same batch), the modification of the in-memory state does happen but only of the scratch batch - the caller always somehow copies over the state according to its needs.
To be honest, I don't understand the reasoning behind #43964 as it relates to the current code, so I'd expect that removing this code
cockroach/pkg/sql/colcontainer/diskqueue.go
Lines 765 to 775 in 8b137b4
for i := range vecs { | |
// When we deserialize a new memory region, we allocate a new null | |
// bitmap for the batch which deserializer will write to. If we naively | |
// allow the arrow batch converter to directly overwrite null bitmap of | |
// each column, it could lead to memory corruption. Doing this avoids | |
// reallocating a new scratchDecompressedReadBytes every time we perform | |
// a read from the file and constrains the downside to allocating a new | |
// null bitmap every couple of batches. | |
nulls := coldata.NewNulls(coldata.BatchSize()) | |
vecs[i].SetNulls(&nulls) | |
} |
would have no effect (with this PR), but things broke once I removed it.
Anyway, we can leave that issue aside for now.
TFTRs!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @asubiotto and @rytaft)
pkg/col/colserde/arrowbatchconverter.go, line 489 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
would be good to add an explanation here similar to what's in the commit message.
There is a comment right above if
and I added another comment to explain how the situation of insufficient capacity could occur.
Build succeeded: |
When serializing the data of Bool, Bytes, Int, and Float types when they
don't have any nulls in the vector, we don't explicit specify the null
bitmap. Previously, when deserializing such vectors with no nulls we
would simply call
UnsetNulls
on thecoldata.Nulls
object that iscurrently present. However, it is possible that already present nulls
object cannot support the desired batch length. This could lead to index
out of bounds accesses. Note that in the vast majority of cases this
likely doesn't happen in practice because we check
MaybeHasNulls
, andthat would return
false
making us omit the null checking code.Fixes: #62636.
Release note (bug fix): Previously, CockroachDB could encounter an
internal error in rare circumstances when executing queries via the
vectorized engine that operate on columns of BOOL, BYTES, INT, and FLOAT
types that have a mix of NULL and non-NULL values.