Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: READ_UNCOMMITTED scan observes WriteIntentError, hits assertion #47219

Closed
nvanbenschoten opened this issue Apr 8, 2020 · 0 comments · Fixed by #47247
Closed

kv: READ_UNCOMMITTED scan observes WriteIntentError, hits assertion #47219

nvanbenschoten opened this issue Apr 8, 2020 · 0 comments · Fixed by #47247
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.

Comments

@nvanbenschoten
Copy link
Member

In #47120, we saw instances of the following assertion fire:

case *roachpb.WriteIntentError:
// Drop latches, but retain lock wait-queues.
g.AssertLatches()

At this point in the executeBatchWithConcurrencyRetries retry loop, we expect any request that hits a WriteIntentError to be holding latches. The assertion was firing because a meta2 scan, which uses the READ_UNCOMMITTED read consistency level and therefore does not acquire latches, was hitting a WriteIntentError. This should not be possible, as the MVCC layer is extra careful to never return this form of error to inconsistent reads.

Further debugging revealed that the WriteIntentError was actually coming from the ScanRequest's call into CollectIntentRows . CollectIntentRows is called to look up the provisional value associated with each intent observed by a READ_UNCOMMITTED scan. This is critical to avoid deadlock during range splits, where a range lookup may need to consider a meta2 intent to be the source of truth in terms of range addressing in order to resolve that intent.

It's completely unexpected that CollectIntentRows would be returning a WriteIntentError, because it uses MVCCGetAsTxn to lookup the provisional values of intents that we just saw during the scan. However, in the failure, we saw that the WriteIntentError was for a different transaction than the one responsible for the original intent. How could this be? Shouldn't we be working off of a consistent engine snapshot in MVCCScan?

Well, no. Read-only requests use a "read-only engine" returned from Engine.NewReadOnly. This engine by itself does not guarantee a consistent snapshot across all uses. The read-only engine contains two cached iterators, one for normal iteration and one for prefix iteration. Each iterator has an implicit snapshot of the DB, but those can be slightly different snapshots. In this case, the MVCCScan used a non-prefix iterator and the MVCCGetAsTxn used a prefix iterator. And since the READ_UNCOMMITTED scan wasn't holding latches, it had no guarantee of isolation within its keyspan across its two iterators.

So to summarize what happened here:

  • a meta2 READ_UNCOMMITTED came in and did not acquire latches
  • is scanned using a non-prefix iterator and hit an intent
  • it passed this intent to CollectIntentRows
  • this used a different prefix iterator and hit a different intent
  • some other write must have slipped in and resolved and replaced the first intent with the second intent between the two iterator creations
  • the scan returned a WriteIntentError to the concurrency manager
  • the assertion blew up
@nvanbenschoten nvanbenschoten added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-transactions Relating to MVCC and the transactional model. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Apr 8, 2020
@nvanbenschoten nvanbenschoten self-assigned this Apr 8, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 9, 2020
Fixes cockroachdb#47219.

This commit addresses the bug diagnosed and explained in cockroachdb#47219. In that
issue, we saw an assertion failure all the way up in the concurrency
manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError,
which should not be possible. The root cause of this issue was that
READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators
pulled from a read-only engine between the time that they were
collecting intent keys and they were returning to fetch the provisional
values for those keys. This mixing of iterators did not guarantee that
the two stages of the operation would observe a consistent snapshot of
the underlying engine, and because the READ_UNCOMMITTED scans also did
not acquire latches, writes were able to slip in and change the intent
while the scan wasn't looking. This caused the scan to throw a
WriteIntentError for the new intent transaction, which badly confused
other parts of the system (rightfully so).

This commit fixes this issue in a few different ways:
1. it ensures that we always use the same iterator type (prefix or non-prefix)
   when retrieving the provisional values for a collection of intents retrieved
   by an earlier scan during READ_UNCOMMITTED operations.
2. it adds an assertion inside of batcheval.CollectIntentRows that the
   function never returns a WriteIntentError. This would have caught the bug
   much more easily, especially back before we had the concurrency manager
   assertion and this bug could have materialized as stuck range lookups and
   potentially even deadlocked splits due to the dependency cycle between
   those two operations.
3. it documents the limited guarantees that read-only engines provide with
   respect to consistent engine snapshots across iterator instances.

We'll want to backport this fix as far back as possible. It won't crash
earlier releases of Cockroach, but as stated above, it might cause even
more disastrous results. REMINDER: when backporting, remember to change
the release note.

Release notes (bug fix): a bug that could cause Cockroach processes to
crash due to an assertion failure with the text "expected latches held,
found none" has been fixed.

Release justification: fixes a high-priority bug in existing
functionality. The bug became louder (now crashes servers) due to recent
changes that added new assertions into the code.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 9, 2020
Fixes cockroachdb#47219.

This commit addresses the bug diagnosed and explained in cockroachdb#47219. In that
issue, we saw an assertion failure all the way up in the concurrency
manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError,
which should not be possible. The root cause of this issue was that
READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators
pulled from a read-only engine between the time that they were
collecting intent keys and they were returning to fetch the provisional
values for those keys. This mixing of iterators did not guarantee that
the two stages of the operation would observe a consistent snapshot of
the underlying engine, and because the READ_UNCOMMITTED scans also did
not acquire latches, writes were able to slip in and change the intent
while the scan wasn't looking. This caused the scan to throw a
WriteIntentError for the new intent transaction, which badly confused
other parts of the system (rightfully so).

This commit fixes this issue in a few different ways:
1. it ensures that we always use the same iterator type (prefix or non-prefix)
   when retrieving the provisional values for a collection of intents retrieved
   by an earlier scan during READ_UNCOMMITTED operations.
2. it adds an assertion inside of batcheval.CollectIntentRows that the
   function never returns a WriteIntentError. This would have caught the bug
   much more easily, especially back before we had the concurrency manager
   assertion and this bug could have materialized as stuck range lookups and
   potentially even deadlocked splits due to the dependency cycle between
   those two operations.
3. it documents the limited guarantees that read-only engines provide with
   respect to consistent engine snapshots across iterator instances.

We'll want to backport this fix as far back as possible. It won't crash
earlier releases of Cockroach, but as stated above, it might cause even
more disastrous results. REMINDER: when backporting, remember to change
the release note.

Release notes (bug fix): a bug that could cause Cockroach processes to
crash due to an assertion failure with the text "expected latches held,
found none" has been fixed.

Release justification: fixes a high-priority bug in existing
functionality. The bug became louder (now crashes servers) due to recent
changes that added new assertions into the code.
craig bot pushed a commit that referenced this issue Apr 9, 2020
44680: ui: Add Storybook to the project r=dhartunian a=koorosh

Depends on: cockroachdb/yarn-vendored#16

- .storybook directory contains custom Storybook and Webpack configurations
- Added Badge stories as a working example
- "antd/es/tooltip/style/css" styles has been moved to the App component
because they apply globally.
- stick to the exact [email protected] version because of incompatibility with
babel-polyfill versions in latest versions of storybook.

47247: kv: don't mix prefix and non-prefix iters when collecting intents r=nvanbenschoten a=nvanbenschoten

Fixes #47219.

This commit addresses the bug diagnosed and explained in #47219. In that issue, we saw an assertion failure all the way up in the concurrency manager because a `READ_UNCOMMITTED` scan was hitting a `WriteIntentError`, which should not be possible. The root cause of this issue was that `READ_UNCOMMITTED` scans were mixing prefix and non-prefix iterators pulled from a read-only engine between the time that they were collecting intent keys and they were returning to fetch the provisional values for those keys. This mixing of iterators did not guarantee that the two stages of the operation would observe a consistent snapshot of the underlying engine, and because the `READ_UNCOMMITTED` scans also did not acquire latches, writes were able to slip in and change the intent while the scan wasn't looking. This caused the scan to throw a `WriteIntentError` for the new intent transaction, which badly confused other parts of the system (rightfully so).

This commit fixes this issue in a few different ways:
1. it ensures that we always use the same iterator type (prefix or non-prefix) when retrieving the provisional values for a collection of intents retrieved by an earlier scan during `READ_UNCOMMITTED` operations.
2. it adds an assertion inside of `batcheval.CollectIntentRows` that the function never returns a `WriteIntentError`. This would have caught the bug much more easily, especially back before we had the concurrency manager assertion and this bug could have materialized as stuck range lookups and potentially even deadlocked splits due to the dependency cycle between those two operations.
3. it documents the limited guarantees that read-only engines provide with respect to consistent engine snapshots across iterator instances.

We'll want to backport this fix as far back as possible. It won't crash earlier releases of Cockroach, but as stated above, it might cause even more disastrous results. REMINDER: when backporting, remember to change the release note.

Release notes (bug fix): a bug that could cause Cockroach processes to crash due to an assertion failure with the text "expected latches held, found none" has been fixed.

Release justification: fixes a high-priority bug in existing functionality. The bug became louder (now crashes servers) due to recent changes that added new assertions into the code.

Co-authored-by: Andrii Vorobiov <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in b9e925b Apr 9, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Apr 9, 2020
Fixes cockroachdb#47219.

This commit addresses the bug diagnosed and explained in cockroachdb#47219. In that
issue, we saw an assertion failure all the way up in the concurrency
manager because a READ_UNCOMMITTED scan was hitting a WriteIntentError,
which should not be possible. The root cause of this issue was that
READ_UNCOMMITTED scans were mixing prefix and non-prefix iterators
pulled from a read-only engine between the time that they were
collecting intent keys and they were returning to fetch the provisional
values for those keys. This mixing of iterators did not guarantee that
the two stages of the operation would observe a consistent snapshot of
the underlying engine, and because the READ_UNCOMMITTED scans also did
not acquire latches, writes were able to slip in and change the intent
while the scan wasn't looking. This caused the scan to throw a
WriteIntentError for the new intent transaction, which badly confused
other parts of the system (rightfully so).

This commit fixes this issue in a few different ways:
1. it ensures that we always use the same iterator type (prefix or non-prefix)
   when retrieving the provisional values for a collection of intents retrieved
   by an earlier scan during READ_UNCOMMITTED operations.
2. it adds an assertion inside of batcheval.CollectIntentRows that the
   function never returns a WriteIntentError. This would have caught the bug
   much more easily, especially back before we had the concurrency manager
   assertion and this bug could have materialized as stuck range lookups and
   potentially even deadlocked splits due to the dependency cycle between
   those two operations.
3. it documents the limited guarantees that read-only engines provide with
   respect to consistent engine snapshots across iterator instances.

We'll want to backport this fix as far back as possible. It won't crash
earlier releases of Cockroach, but as stated above, it might cause even
more disastrous results. REMINDER: when backporting, remember to change
the release note.

Release notes (bug fix): a bug that could cause Cockroach processes to
crash due to an assertion failure with the text "expected latches held,
found none" has been fixed.

Release justification: fixes a high-priority bug in existing
functionality. The bug became louder (now crashes servers) due to recent
changes that added new assertions into the code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant