Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Erik Grinaker <[email protected]>
  • Loading branch information
lunevalex and erikgrinaker committed May 10, 2021
1 parent 7b1fd71 commit bf6df8f
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 10 deletions.
2 changes: 1 addition & 1 deletion _data/advisories.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
- advisory: 64325
summary: Race condition between reads and replica removal
versions: 19.2 and later
versions: 20.1 and later
date: May 3, 2021
- advisory: 63162
summary: Invalid incremental backups under certain circumstances
Expand Down
19 changes: 10 additions & 9 deletions advisories/a64325.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: Technical Advisory 64325
summary: Race condition between long running reads and replica removal
summary: Race condition between reads and replica removal
toc: true
---

Expand All @@ -10,32 +10,33 @@ Publication date: May 3, 2021

Cockroach Labs has discovered a race condition, where a long running read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range.

This is a very rare scenario that can only be encountered, if a number of very unlikely conditions are met:
This is a rare scenario that can only be encountered if a number of conditions are met:

- A read request checks that the node is holding the lease and begins processing
- The node loses the lease before the read request is able to lock the replica for reading
- The read request checks the state of the replica and makes sure it is alive
- The new leaseholder decides to immediately removes the old leaseholder replica from the range, due to a rebalance or range merge.
- The new leaseholder decides to immediately remove the old leaseholder replica from the range, due to a rebalance or range merge.
- The removal is processed before the read requests starts reading the data from the replica
- The read request observes the replica in a post-deletion state returning an empty result set.

Due to the nature of this race condition, we believe that it is only likely for this to happen on an extremely overloaded node that is struggling to keep up with processing requests and experiences natural delays between each step in the processing of a read request.

This issue affects CockroachDB [versions](/docs/releases/) 19.2 and later.
This issue affects CockroachDB [versions](/docs/releases/) 20.1 and later.

## Statement
This is resolved in CockroachDB by PR [#64324], which fix the race condition between read requests and replica removal.
This is resolved in CockroachDB by PR [#64324], which fixes the race condition between read requests and replica removal.

The fix has been applied to maintenance releases of CockroachDB v20.1, v20.2, and v21.1.

This public issue is tracked by [#64325].

## Mitigation

Users of CockroachDB v19.2, v20.1, or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version.

This issue may impact the backups of CockroachDB. You can take the following steps to validate backups to make sure they are correct and complete. TODO: @mattsherman.
Users of CockroachDB v20.1 or v20.2 are encouraged to upgrade to v20.1.16, v20.2.9, or a later version.

## Impact

This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. As read operations include backup and incremental backup, if your cluster has experienced overloaded conditions we encourage you to verify your backups using the instructions provided above.
This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot this condition. If you would like help to check this in your environment, please contact them.

[#64324]: https://github.com/cockroachdb/cockroach/pull/64324
[#64325]: https://github.com/cockroachdb/cockroach/issues/64325

0 comments on commit bf6df8f

Please sign in to comment.