colserde: ArrowBatchConverter memory corruption scenario #43964

asubiotto · 2020-01-14T16:00:05Z

Consider a region of memory that holds data to deserialize coldata.Batches from and a single container batch that is deserialized into. Let's say that there are three batches to deserialize. Since our deserialization is no-copy, the container batch will point to subsets of the memory region that we are deserializing from after deserializing the third batch.

Now, if that memory region is reused to load the bytes that represent the next three batches to deserialize, the container batch will still point to the same region of memory, which now contains bytes that we would like to read from in the future. However, when deserializing the first batch, the container batch is Reset here:

cockroach/pkg/col/colserde/arrowbatchconverter.go

Line 239 in 58a0536

b.Reset(c.typs, n)

In some cases (Bytes) columns, that will actually zero out a region of memory. The container batch's column is then overwritten and deserialization proceeds fine until we attempt to deserialize the third batch and encounter the corrupted memory that was overwritten when deserializing the first batch of the second pass.

This is only present in my prototype disk queue PR where I read bytes from a file into the same memory region. In case of serialization over the network, I think this case doesn't happen since new messages don't reuse buffers. I'm still trying to wrap my head around how we should fix this. I've considered removing the Reset call since we overwrite data in the container batch anyway but we'd still need a way to handle cases where there are no nulls, since that is encoded using a null bitmap, which means we can't just call UnsetNulls on the container batch and need to instantiate a new null bitmap (which is probably not that expensive).

The text was updated successfully, but these errors were encountered:

Previously, the ArrowBatchCoverter resets the input batch when performing translation from Arrow to Batch. However, this creates issue this Reset can potentially perform allocation and it is not tracked by allocator. Now ArrowBatchConverter no longer performs this reset and it is expecting the input batch will have the correct schema and capacity. Closes cockroachdb#43964 Release note: None

47705: colserde: remove batch reset logic from ArrowBatchConverter r=yuzefovich,asubiotto a=Azhng Previously, the ArrowBatchCoverter resets the input batch when performing translation from Arrow to Batch. However, this creates issue this Reset can potentially perform allocation and it is not tracked by allocator. Now ArrowBatchConverter no longer performs this reset and it is expecting the input batch will have the correct schema and capacity. Release note: None Closes #43964 Co-authored-by: Azhng <[email protected]>

…tmap Issue cockroachdb#43964 outlines a possible memory corruption scenario that can occur if we reset the either the data or the nulls in the region to be serialized. Originally, to go around that we were creating new vectors whenever a new region of memory is deserialized from. Later, we removed the vector resetting logic but kept the null resetting behavior in cockroachdb#47705. This commit finishes the job by never using the deserialized region directly and always performing a copy (for nulls). This behavior seems less error-prone to me (this would have prevented recently discovered bug). Release note: None

…tmap Issue cockroachdb#43964 outlines a possible memory corruption scenario that can occur if we reset either the data or the nulls in the region to be serialized. Originally, to go around that we were creating new vectors whenever a new region of memory was deserialized from. Later, we removed the vector resetting logic but kept the null resetting behavior in cockroachdb#47705. This commit finishes the job by never using the deserialized region directly and always performing a copy (for nulls). This behavior seems less error-prone to me (this would have prevented recently discovered bug). Release note: None

62581: sql: properly synchronize the internal executor iterator r=yuzefovich a=yuzefovich **sql: fix an edge case in the internal executor** `SetColumns` contract allows for the argument to be nil, yet the iterator of the streaming internal executor expects that column schema, if set, is non-nil. I don't think this could happen in practice, but theoretically previously we could encounter an assertion error due to none of the four fields in `ieIteratorResult` object being non-nil, and now this is fixed. Release note: None **sql: add context to IncrementRowsAffected interface** Release note: None **sql: properly synchronize the internal executor iterator** We recently merged an update to the internal executor to make it streaming. Currently it is implemented by having two goroutines (the iterator, "the consumer"; and the connExecutor, "the producer"). The communication between them is done on the buffered channel. As a result, both goroutines can run concurrently. However, a crucial point was overlooked - our `kv.Txn`s cannot be used concurrently. Imagine a plan that reads from two tables each of which is populated via the internal executor: if we read from the first, and then from the second concurrently, we will have the concurrent usage of the txn for that plan. This commit the problem by carving out a new abstraction to optionally synchronize the execution of the internal executor with its corresponding iterator. The abstraction comes in sync and async flavors. In the sync form, the ieResultChannel ensures that the reader and writer do not concurrently operate, and, additionally provides a mechanism whereby the reader may ensure that the writer observes an error when its attempts to publish the previous row returns. This last point is critical to ensure that transactions are not erroneously used after it has become unsafe. The async flavor is still used by the internal executor when it doesn't return an iterator directly and executes the query to completion itself. Fixes: #62415. Release note: None (no stable release with this bug). 62644: colcontainer: use explicit copy for nulls instead of replacing the bitmap r=yuzefovich a=yuzefovich Issue #43964 outlines a possible memory corruption scenario that can occur if we reset either the data or the nulls in the region to be deserialized. Originally, to go around that we were creating new vectors whenever a new region of memory is deserialized from. Later, we removed the vector resetting logic but kept the null resetting behavior in #47705. This commit finishes the job by never using the deserialized region directly and always performing a copy (for nulls). This behavior seems less error-prone to me (this would have prevented recently discovered bug). Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Werner <[email protected]>

asubiotto self-assigned this Jan 14, 2020

yuzefovich mentioned this issue Apr 21, 2020

colserde: remove batch reset logic from ArrowBatchConverter #47705

Merged

craig bot closed this as completed in b1769c5 Apr 23, 2020

yuzefovich mentioned this issue Mar 26, 2021

colcontainer: use explicit copy for nulls instead of replacing the bitmap #62644

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colserde: ArrowBatchConverter memory corruption scenario #43964

colserde: ArrowBatchConverter memory corruption scenario #43964

asubiotto commented Jan 14, 2020

colserde: ArrowBatchConverter memory corruption scenario #43964

colserde: ArrowBatchConverter memory corruption scenario #43964

Comments

asubiotto commented Jan 14, 2020