-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colserde: ArrowBatchConverter memory corruption scenario #43964
Comments
Azhng
added a commit
to Azhng/cockroach
that referenced
this issue
Apr 22, 2020
Previously, the ArrowBatchCoverter resets the input batch when performing translation from Arrow to Batch. However, this creates issue this Reset can potentially perform allocation and it is not tracked by allocator. Now ArrowBatchConverter no longer performs this reset and it is expecting the input batch will have the correct schema and capacity. Closes cockroachdb#43964 Release note: None
craig bot
pushed a commit
that referenced
this issue
Apr 23, 2020
47705: colserde: remove batch reset logic from ArrowBatchConverter r=yuzefovich,asubiotto a=Azhng Previously, the ArrowBatchCoverter resets the input batch when performing translation from Arrow to Batch. However, this creates issue this Reset can potentially perform allocation and it is not tracked by allocator. Now ArrowBatchConverter no longer performs this reset and it is expecting the input batch will have the correct schema and capacity. Release note: None Closes #43964 Co-authored-by: Azhng <[email protected]>
yuzefovich
added a commit
to yuzefovich/cockroach
that referenced
this issue
Mar 26, 2021
…tmap Issue cockroachdb#43964 outlines a possible memory corruption scenario that can occur if we reset the either the data or the nulls in the region to be serialized. Originally, to go around that we were creating new vectors whenever a new region of memory is deserialized from. Later, we removed the vector resetting logic but kept the null resetting behavior in cockroachdb#47705. This commit finishes the job by never using the deserialized region directly and always performing a copy (for nulls). This behavior seems less error-prone to me (this would have prevented recently discovered bug). Release note: None
yuzefovich
added a commit
to yuzefovich/cockroach
that referenced
this issue
Mar 31, 2021
…tmap Issue cockroachdb#43964 outlines a possible memory corruption scenario that can occur if we reset either the data or the nulls in the region to be serialized. Originally, to go around that we were creating new vectors whenever a new region of memory was deserialized from. Later, we removed the vector resetting logic but kept the null resetting behavior in cockroachdb#47705. This commit finishes the job by never using the deserialized region directly and always performing a copy (for nulls). This behavior seems less error-prone to me (this would have prevented recently discovered bug). Release note: None
craig bot
pushed a commit
that referenced
this issue
Apr 1, 2021
62581: sql: properly synchronize the internal executor iterator r=yuzefovich a=yuzefovich **sql: fix an edge case in the internal executor** `SetColumns` contract allows for the argument to be nil, yet the iterator of the streaming internal executor expects that column schema, if set, is non-nil. I don't think this could happen in practice, but theoretically previously we could encounter an assertion error due to none of the four fields in `ieIteratorResult` object being non-nil, and now this is fixed. Release note: None **sql: add context to IncrementRowsAffected interface** Release note: None **sql: properly synchronize the internal executor iterator** We recently merged an update to the internal executor to make it streaming. Currently it is implemented by having two goroutines (the iterator, "the consumer"; and the connExecutor, "the producer"). The communication between them is done on the buffered channel. As a result, both goroutines can run concurrently. However, a crucial point was overlooked - our `kv.Txn`s cannot be used concurrently. Imagine a plan that reads from two tables each of which is populated via the internal executor: if we read from the first, and then from the second concurrently, we will have the concurrent usage of the txn for that plan. This commit the problem by carving out a new abstraction to optionally synchronize the execution of the internal executor with its corresponding iterator. The abstraction comes in sync and async flavors. In the sync form, the ieResultChannel ensures that the reader and writer do not concurrently operate, and, additionally provides a mechanism whereby the reader may ensure that the writer observes an error when its attempts to publish the previous row returns. This last point is critical to ensure that transactions are not erroneously used after it has become unsafe. The async flavor is still used by the internal executor when it doesn't return an iterator directly and executes the query to completion itself. Fixes: #62415. Release note: None (no stable release with this bug). 62644: colcontainer: use explicit copy for nulls instead of replacing the bitmap r=yuzefovich a=yuzefovich Issue #43964 outlines a possible memory corruption scenario that can occur if we reset either the data or the nulls in the region to be deserialized. Originally, to go around that we were creating new vectors whenever a new region of memory is deserialized from. Later, we removed the vector resetting logic but kept the null resetting behavior in #47705. This commit finishes the job by never using the deserialized region directly and always performing a copy (for nulls). This behavior seems less error-prone to me (this would have prevented recently discovered bug). Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Consider a region of memory that holds data to deserialize
coldata.Batch
es from and a single container batch that is deserialized into. Let's say that there are three batches to deserialize. Since our deserialization is no-copy, the container batch will point to subsets of the memory region that we are deserializing from after deserializing the third batch.Now, if that memory region is reused to load the bytes that represent the next three batches to deserialize, the container batch will still point to the same region of memory, which now contains bytes that we would like to read from in the future. However, when deserializing the first batch, the container batch is
Reset
here:cockroach/pkg/col/colserde/arrowbatchconverter.go
Line 239 in 58a0536
In some cases (
Bytes
) columns, that will actually zero out a region of memory. The container batch's column is then overwritten and deserialization proceeds fine until we attempt to deserialize the third batch and encounter the corrupted memory that was overwritten when deserializing the first batch of the second pass.This is only present in my prototype disk queue PR where I read bytes from a file into the same memory region. In case of serialization over the network, I think this case doesn't happen since new messages don't reuse buffers. I'm still trying to wrap my head around how we should fix this. I've considered removing the
Reset
call since we overwrite data in the container batch anyway but we'd still need a way to handle cases where there are no nulls, since that is encoded using a null bitmap, which means we can't just callUnsetNull
s on the container batch and need to instantiate a new null bitmap (which is probably not that expensive).The text was updated successfully, but these errors were encountered: