-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: READ_UNCOMMITTED scan observes WriteIntentError, hits assertion #47219
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Comments
nvanbenschoten
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
A-kv-transactions
Relating to MVCC and the transactional model.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
labels
Apr 8, 2020
This was referenced Apr 8, 2020
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Apr 9, 2020
Fixes cockroachdb#47219. This commit addresses the bug diagnosed and explained in cockroachdb#47219. In that issue, we saw an assertion failure all the way up in the concurrency manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError, which should not be possible. The root cause of this issue was that READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators pulled from a read-only engine between the time that they were collecting intent keys and they were returning to fetch the provisional values for those keys. This mixing of iterators did not guarantee that the two stages of the operation would observe a consistent snapshot of the underlying engine, and because the READ_UNCOMMITTED scans also did not acquire latches, writes were able to slip in and change the intent while the scan wasn't looking. This caused the scan to throw a WriteIntentError for the new intent transaction, which badly confused other parts of the system (rightfully so). This commit fixes this issue in a few different ways: 1. it ensures that we always use the same iterator type (prefix or non-prefix) when retrieving the provisional values for a collection of intents retrieved by an earlier scan during READ_UNCOMMITTED operations. 2. it adds an assertion inside of batcheval.CollectIntentRows that the function never returns a WriteIntentError. This would have caught the bug much more easily, especially back before we had the concurrency manager assertion and this bug could have materialized as stuck range lookups and potentially even deadlocked splits due to the dependency cycle between those two operations. 3. it documents the limited guarantees that read-only engines provide with respect to consistent engine snapshots across iterator instances. We'll want to backport this fix as far back as possible. It won't crash earlier releases of Cockroach, but as stated above, it might cause even more disastrous results. REMINDER: when backporting, remember to change the release note. Release notes (bug fix): a bug that could cause Cockroach processes to crash due to an assertion failure with the text "expected latches held, found none" has been fixed. Release justification: fixes a high-priority bug in existing functionality. The bug became louder (now crashes servers) due to recent changes that added new assertions into the code.
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Apr 9, 2020
Fixes cockroachdb#47219. This commit addresses the bug diagnosed and explained in cockroachdb#47219. In that issue, we saw an assertion failure all the way up in the concurrency manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError, which should not be possible. The root cause of this issue was that READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators pulled from a read-only engine between the time that they were collecting intent keys and they were returning to fetch the provisional values for those keys. This mixing of iterators did not guarantee that the two stages of the operation would observe a consistent snapshot of the underlying engine, and because the READ_UNCOMMITTED scans also did not acquire latches, writes were able to slip in and change the intent while the scan wasn't looking. This caused the scan to throw a WriteIntentError for the new intent transaction, which badly confused other parts of the system (rightfully so). This commit fixes this issue in a few different ways: 1. it ensures that we always use the same iterator type (prefix or non-prefix) when retrieving the provisional values for a collection of intents retrieved by an earlier scan during READ_UNCOMMITTED operations. 2. it adds an assertion inside of batcheval.CollectIntentRows that the function never returns a WriteIntentError. This would have caught the bug much more easily, especially back before we had the concurrency manager assertion and this bug could have materialized as stuck range lookups and potentially even deadlocked splits due to the dependency cycle between those two operations. 3. it documents the limited guarantees that read-only engines provide with respect to consistent engine snapshots across iterator instances. We'll want to backport this fix as far back as possible. It won't crash earlier releases of Cockroach, but as stated above, it might cause even more disastrous results. REMINDER: when backporting, remember to change the release note. Release notes (bug fix): a bug that could cause Cockroach processes to crash due to an assertion failure with the text "expected latches held, found none" has been fixed. Release justification: fixes a high-priority bug in existing functionality. The bug became louder (now crashes servers) due to recent changes that added new assertions into the code.
craig bot
pushed a commit
that referenced
this issue
Apr 9, 2020
44680: ui: Add Storybook to the project r=dhartunian a=koorosh Depends on: cockroachdb/yarn-vendored#16 - .storybook directory contains custom Storybook and Webpack configurations - Added Badge stories as a working example - "antd/es/tooltip/style/css" styles has been moved to the App component because they apply globally. - stick to the exact [email protected] version because of incompatibility with babel-polyfill versions in latest versions of storybook. 47247: kv: don't mix prefix and non-prefix iters when collecting intents r=nvanbenschoten a=nvanbenschoten Fixes #47219. This commit addresses the bug diagnosed and explained in #47219. In that issue, we saw an assertion failure all the way up in the concurrency manager because a `READ_UNCOMMITTED` scan was hitting a `WriteIntentError`, which should not be possible. The root cause of this issue was that `READ_UNCOMMITTED` scans were mixing prefix and non-prefix iterators pulled from a read-only engine between the time that they were collecting intent keys and they were returning to fetch the provisional values for those keys. This mixing of iterators did not guarantee that the two stages of the operation would observe a consistent snapshot of the underlying engine, and because the `READ_UNCOMMITTED` scans also did not acquire latches, writes were able to slip in and change the intent while the scan wasn't looking. This caused the scan to throw a `WriteIntentError` for the new intent transaction, which badly confused other parts of the system (rightfully so). This commit fixes this issue in a few different ways: 1. it ensures that we always use the same iterator type (prefix or non-prefix) when retrieving the provisional values for a collection of intents retrieved by an earlier scan during `READ_UNCOMMITTED` operations. 2. it adds an assertion inside of `batcheval.CollectIntentRows` that the function never returns a `WriteIntentError`. This would have caught the bug much more easily, especially back before we had the concurrency manager assertion and this bug could have materialized as stuck range lookups and potentially even deadlocked splits due to the dependency cycle between those two operations. 3. it documents the limited guarantees that read-only engines provide with respect to consistent engine snapshots across iterator instances. We'll want to backport this fix as far back as possible. It won't crash earlier releases of Cockroach, but as stated above, it might cause even more disastrous results. REMINDER: when backporting, remember to change the release note. Release notes (bug fix): a bug that could cause Cockroach processes to crash due to an assertion failure with the text "expected latches held, found none" has been fixed. Release justification: fixes a high-priority bug in existing functionality. The bug became louder (now crashes servers) due to recent changes that added new assertions into the code. Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten
added a commit
to nvanbenschoten/cockroach
that referenced
this issue
Apr 9, 2020
Fixes cockroachdb#47219. This commit addresses the bug diagnosed and explained in cockroachdb#47219. In that issue, we saw an assertion failure all the way up in the concurrency manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError, which should not be possible. The root cause of this issue was that READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators pulled from a read-only engine between the time that they were collecting intent keys and they were returning to fetch the provisional values for those keys. This mixing of iterators did not guarantee that the two stages of the operation would observe a consistent snapshot of the underlying engine, and because the READ_UNCOMMITTED scans also did not acquire latches, writes were able to slip in and change the intent while the scan wasn't looking. This caused the scan to throw a WriteIntentError for the new intent transaction, which badly confused other parts of the system (rightfully so). This commit fixes this issue in a few different ways: 1. it ensures that we always use the same iterator type (prefix or non-prefix) when retrieving the provisional values for a collection of intents retrieved by an earlier scan during READ_UNCOMMITTED operations. 2. it adds an assertion inside of batcheval.CollectIntentRows that the function never returns a WriteIntentError. This would have caught the bug much more easily, especially back before we had the concurrency manager assertion and this bug could have materialized as stuck range lookups and potentially even deadlocked splits due to the dependency cycle between those two operations. 3. it documents the limited guarantees that read-only engines provide with respect to consistent engine snapshots across iterator instances. We'll want to backport this fix as far back as possible. It won't crash earlier releases of Cockroach, but as stated above, it might cause even more disastrous results. REMINDER: when backporting, remember to change the release note. Release notes (bug fix): a bug that could cause Cockroach processes to crash due to an assertion failure with the text "expected latches held, found none" has been fixed. Release justification: fixes a high-priority bug in existing functionality. The bug became louder (now crashes servers) due to recent changes that added new assertions into the code.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-kv-transactions
Relating to MVCC and the transactional model.
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
release-blocker
Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
In #47120, we saw instances of the following assertion fire:
cockroach/pkg/kv/kvserver/replica_send.go
Lines 247 to 249 in 27debe8
At this point in the
executeBatchWithConcurrencyRetries
retry loop, we expect any request that hits aWriteIntentError
to be holding latches. The assertion was firing because a meta2 scan, which uses theREAD_UNCOMMITTED
read consistency level and therefore does not acquire latches, was hitting aWriteIntentError
. This should not be possible, as the MVCC layer is extra careful to never return this form of error to inconsistent reads.Further debugging revealed that the
WriteIntentError
was actually coming from theScanRequest
's call intoCollectIntentRows
.CollectIntentRows
is called to look up the provisional value associated with each intent observed by aREAD_UNCOMMITTED
scan. This is critical to avoid deadlock during range splits, where a range lookup may need to consider a meta2 intent to be the source of truth in terms of range addressing in order to resolve that intent.It's completely unexpected that
CollectIntentRows
would be returning aWriteIntentError
, because it usesMVCCGetAsTxn
to lookup the provisional values of intents that we just saw during the scan. However, in the failure, we saw that theWriteIntentError
was for a different transaction than the one responsible for the original intent. How could this be? Shouldn't we be working off of a consistent engine snapshot inMVCCScan
?Well, no. Read-only requests use a "read-only engine" returned from
Engine.NewReadOnly
. This engine by itself does not guarantee a consistent snapshot across all uses. The read-only engine contains two cached iterators, one for normal iteration and one for prefix iteration. Each iterator has an implicit snapshot of the DB, but those can be slightly different snapshots. In this case, theMVCCScan
used a non-prefix iterator and theMVCCGetAsTxn
used a prefix iterator. And since theREAD_UNCOMMITTED
scan wasn't holding latches, it had no guarantee of isolation within its keyspan across its two iterators.So to summarize what happened here:
READ_UNCOMMITTED
came in and did not acquire latchesCollectIntentRows
The text was updated successfully, but these errors were encountered: