-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Technical advisory for race condition between long running reads and … #10478
Conversation
Files changed: _data/advisories.yml |
Netlify Preview Built with commit bf6df8f |
3c04074
to
a9068e0
Compare
Files changed: _data/advisories.yml |
Small tweaks to first paragraph in Description: Cockroach Labs has discovered a race condition, where a long running read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range. And to Impact: This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. As read operations include backup and incremental backup, if your cluster has experienced overloaded conditions we encourage you to verify your backups using the instructions provided above. Open question as discussed on Slack about: does the read have to start before and finish after all the liveness, lease change, merge operations have completed? |
a9068e0
to
01b3c0a
Compare
Files changed: _data/advisories.yml |
01b3c0a
to
f9d7fb7
Compare
Files changed: _data/advisories.yml |
f9d7fb7
to
c72a32f
Compare
Files changed: _data/advisories.yml |
@nickelnickel updated |
LGTM pending the addition from @shermanCRL . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex, @nickelnickel, and @shermanCRL)
advisories/a64325.md, line 29 at r3 (raw file):
This is resolved in CockroachDB by PR [#64324], which fix the race condition between read requests and replica removal. The fix has been applied to maintenance releases of CockroachDB v20.1, v20.2, and v21.1.
Have we considered mentioning the exact release? I feel like people would ask "what do I need to upgrade to"?
## Statement | ||
This is resolved in CockroachDB by PR [#64324], which fix the race condition between read requests and replica removal. | ||
|
||
The fix has been applied to maintenance releases of CockroachDB v20.1, v20.2, and v21.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these released? We should say what patch version. It’d be convenient to offer the releases page link here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet, we will request extraordinary releases for these next week
Additionally, in the last tech advisory, I added callouts on specific version pages, per historical precedent. Have a look at this PR: #10277 Those updates don’t need to block this PR, but you’ll want to follow up. |
advisories/a64325.md
Outdated
|
||
Users of CockroachDB v19.2, v20.1, or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version. | ||
|
||
This issue may impact the backups of CockroachDB. You can take the following steps to validate backups to make sure they are correct and complete. TODO: @mattsherman. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pbardea Would you fill this in? I assume it’s a summary of this: https://github.com/cockroachlabs/support/issues/940#issuecomment-828814334
It’s important to say what the query can or can’t detect.
--
Format-wise, because these instructions are long, maybe they don’t belong in this document.
Should they be added as a comment on the public issue? cockroachdb/cockroach#64325
Thoughts @lunevalex ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will follow up. There are a few queries that can help detect these missing from full backups. I've posted them in https://github.com/cockroachlabs/support/issues/940#issuecomment-830420092. They each have tradeoffs in terms of how many edge cases they consider and the number of false positives returned. They're all fairly clunky, especially to explain in general terms in a TA.
Perhaps the best solution here is to find a way to detect these issues with a high-probability but with minimal false-positives reported?
Of note: I don't think this issue is detectable at all in incremental backups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a better decision not to include this in the advisory for now, too much detail. We’ll use the above comment as a resource for the support team, and think about where a public version would make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can go with a link and keep the advisory simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that read/write requests have a similar bug that hasn't been fixed yet, PR cockroachdb/cockroach#64471 has been submitted for this.
Co-authored-by: Erik Grinaker <[email protected]>
Files changed: _data/advisories.yml |
2 similar comments
Files changed: _data/advisories.yml |
Files changed: _data/advisories.yml |
It's mentioned in the mitigation section |
e1c78c4
to
fd6aea1
Compare
Files changed: _data/advisories.yml |
advisories/a64325.md
Outdated
|
||
## Impact | ||
|
||
This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot where this might be true, please contact them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
## Description | ||
|
||
Cockroach Labs has discovered a race condition, where a long running read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cockroach Labs has discovered a race condition, where a long running read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range. | |
Cockroach Labs has discovered a race condition, where a read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range. |
advisories/a64325.md
Outdated
|
||
## Mitigation | ||
|
||
Users of CockroachDB v19.2, v20.1, or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users of CockroachDB v19.2, v20.1, or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version. | |
Users of CockroachDB v20.1 or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version. |
advisories/a64325.md
Outdated
|
||
## Impact | ||
|
||
This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot where this might be true, please contact them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot where this might be true, please contact them. | |
This issue may impact all read operations. While the window of opportunity to see it is very limited, it may have happened without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot where this might be true, please contact them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. My only suggestions are minor:
- Users of CockroachDB v20.1 or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version.
- Users of CockroachDB v20.1 or v20.2 are encouraged to upgrade to v20.1.16, v20.2.9, or a later version.
- Our support team has tools to detect and troubleshoot where this might be true, please contact them.
- Our support team has tools to detect and troubleshoot this condition. If you would like help to check this in your environment, please contact them.
Co-authored-by: Erik Grinaker <[email protected]>
fd6aea1
to
bf6df8f
Compare
Files changed: _data/advisories.yml |
…replica removal.