Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colserde: ArrowBatchConverter memory corruption scenario #43964

Closed
asubiotto opened this issue Jan 14, 2020 · 0 comments · Fixed by #47705
Closed

colserde: ArrowBatchConverter memory corruption scenario #43964

asubiotto opened this issue Jan 14, 2020 · 0 comments · Fixed by #47705
Assignees

Comments

@asubiotto
Copy link
Contributor

Consider a region of memory that holds data to deserialize coldata.Batches from and a single container batch that is deserialized into. Let's say that there are three batches to deserialize. Since our deserialization is no-copy, the container batch will point to subsets of the memory region that we are deserializing from after deserializing the third batch.

Now, if that memory region is reused to load the bytes that represent the next three batches to deserialize, the container batch will still point to the same region of memory, which now contains bytes that we would like to read from in the future. However, when deserializing the first batch, the container batch is Reset here:


In some cases (Bytes) columns, that will actually zero out a region of memory. The container batch's column is then overwritten and deserialization proceeds fine until we attempt to deserialize the third batch and encounter the corrupted memory that was overwritten when deserializing the first batch of the second pass.

This is only present in my prototype disk queue PR where I read bytes from a file into the same memory region. In case of serialization over the network, I think this case doesn't happen since new messages don't reuse buffers. I'm still trying to wrap my head around how we should fix this. I've considered removing the Reset call since we overwrite data in the container batch anyway but we'd still need a way to handle cases where there are no nulls, since that is encoded using a null bitmap, which means we can't just call UnsetNulls on the container batch and need to instantiate a new null bitmap (which is probably not that expensive).

@asubiotto asubiotto self-assigned this Jan 14, 2020
Azhng added a commit to Azhng/cockroach that referenced this issue Apr 22, 2020
Previously, the ArrowBatchCoverter resets the input batch when performing
translation from Arrow to Batch. However, this creates issue this Reset
can potentially perform allocation and it is not tracked by allocator.

Now ArrowBatchConverter no longer performs this reset and it is expecting
the input batch will have the correct schema and capacity.

Closes cockroachdb#43964

Release note: None
craig bot pushed a commit that referenced this issue Apr 23, 2020
47705: colserde: remove batch reset logic from ArrowBatchConverter r=yuzefovich,asubiotto a=Azhng

Previously, the ArrowBatchCoverter resets the input batch when performing
translation from Arrow to Batch. However, this creates issue this Reset
can potentially perform allocation and it is not tracked by allocator.

Now ArrowBatchConverter no longer performs this reset and it is expecting
the input batch will have the correct schema and capacity.

Release note: None

Closes #43964

Co-authored-by: Azhng <[email protected]>
@craig craig bot closed this as completed in b1769c5 Apr 23, 2020
yuzefovich added a commit to yuzefovich/cockroach that referenced this issue Mar 26, 2021
…tmap

Issue cockroachdb#43964 outlines a possible memory corruption scenario that can
occur if we reset the either the data or the nulls in the region to be
serialized. Originally, to go around that we were creating new vectors
whenever a new region of memory is deserialized from. Later, we removed
the vector resetting logic but kept the null resetting behavior in cockroachdb#47705.
This commit finishes the job by never using the deserialized region
directly and always performing a copy (for nulls). This behavior seems
less error-prone to me (this would have prevented recently discovered
bug).

Release note: None
yuzefovich added a commit to yuzefovich/cockroach that referenced this issue Mar 31, 2021
…tmap

Issue cockroachdb#43964 outlines a possible memory corruption scenario that can
occur if we reset either the data or the nulls in the region to be
serialized. Originally, to go around that we were creating new vectors
whenever a new region of memory was deserialized from. Later, we removed
the vector resetting logic but kept the null resetting behavior in cockroachdb#47705.
This commit finishes the job by never using the deserialized region
directly and always performing a copy (for nulls). This behavior seems
less error-prone to me (this would have prevented recently discovered
bug).

Release note: None
craig bot pushed a commit that referenced this issue Apr 1, 2021
62581: sql: properly synchronize the internal executor iterator r=yuzefovich a=yuzefovich

**sql: fix an edge case in the internal executor**

`SetColumns` contract allows for the argument to be nil, yet the
iterator of the streaming internal executor expects that column schema,
if set, is non-nil. I don't think this could happen in practice, but
theoretically previously we could encounter an assertion error due to
none of the four fields in `ieIteratorResult` object being non-nil, and
now this is fixed.

Release note: None

**sql: add context to IncrementRowsAffected interface**

Release note: None

**sql: properly synchronize the internal executor iterator**

We recently merged an update to the internal executor to make it
streaming. Currently it is implemented by having two goroutines (the
iterator, "the consumer"; and the connExecutor, "the producer"). The
communication between them is done on the buffered channel. As a result,
both goroutines can run concurrently.

However, a crucial point was overlooked - our `kv.Txn`s cannot be used
concurrently. Imagine a plan that reads from two tables each of which is
populated via the internal executor: if we read from the first, and then
from the second concurrently, we will have the concurrent usage of the
txn for that plan.

This commit the problem by carving out a new abstraction to optionally
synchronize the execution of the internal executor with its corresponding
iterator. The abstraction comes in sync and async flavors. In the sync
form, the ieResultChannel ensures that the reader and writer do not
concurrently operate, and, additionally provides a mechanism whereby the
reader may ensure that the writer observes an error when its attempts to
publish the previous row returns. This last point is critical to ensure
that transactions are not erroneously used after it has become unsafe.

The async flavor is still used by the internal executor when it doesn't
return an iterator directly and executes the query to completion itself.

Fixes: #62415.

Release note: None (no stable release with this bug).

62644: colcontainer: use explicit copy for nulls instead of replacing the bitmap r=yuzefovich a=yuzefovich

Issue #43964 outlines a possible memory corruption scenario that can
occur if we reset either the data or the nulls in the region to be
deserialized. Originally, to go around that we were creating new vectors
whenever a new region of memory is deserialized from. Later, we removed
the vector resetting logic but kept the null resetting behavior in #47705.
This commit finishes the job by never using the deserialized region
directly and always performing a copy (for nulls). This behavior seems
less error-prone to me (this would have prevented recently discovered
bug).

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant