Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Technical advisory for race condition between long running reads and … #10478

Merged
merged 3 commits into from
May 10, 2021

Conversation

lunevalex
Copy link
Contributor

…replica removal.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@github-actions
Copy link

Files changed:

_data/advisories.yml
advisories/a64325.md

@netlify
Copy link

netlify bot commented Apr 30, 2021

Netlify Preview

Built with commit bf6df8f

https://deploy-preview-10478--cockroachdb-docs.netlify.app

@lunevalex lunevalex force-pushed the alex/read-corruption-advisory branch from 3c04074 to a9068e0 Compare April 30, 2021 16:54
@github-actions
Copy link

Files changed:

_data/advisories.yml
advisories/a64325.md

@lunevalex lunevalex requested a review from nickelnickel April 30, 2021 17:50
@nickelnickel
Copy link

Small tweaks to first paragraph in Description:

Cockroach Labs has discovered a race condition, where a long running read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range.


And to Impact:

This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. As read operations include backup and incremental backup, if your cluster has experienced overloaded conditions we encourage you to verify your backups using the instructions provided above.


Open question as discussed on Slack about: - The replica is marked for removal and data is cleared.

does the read have to start before and finish after all the liveness, lease change, merge operations have completed?

@lunevalex lunevalex force-pushed the alex/read-corruption-advisory branch from a9068e0 to 01b3c0a Compare April 30, 2021 18:25
@github-actions
Copy link

Files changed:

_data/advisories.yml
advisories/a64325.md

advisories/a64325.md Outdated Show resolved Hide resolved
advisories/a64325.md Outdated Show resolved Hide resolved
advisories/a64325.md Outdated Show resolved Hide resolved
@lunevalex lunevalex force-pushed the alex/read-corruption-advisory branch from 01b3c0a to f9d7fb7 Compare April 30, 2021 19:36
@github-actions
Copy link

Files changed:

_data/advisories.yml
advisories/a64325.md

@lunevalex lunevalex force-pushed the alex/read-corruption-advisory branch from f9d7fb7 to c72a32f Compare April 30, 2021 19:38
@github-actions
Copy link

Files changed:

_data/advisories.yml
advisories/a64325.md

@lunevalex
Copy link
Contributor Author

@nickelnickel updated

@nickelnickel
Copy link

LGTM pending the addition from @shermanCRL .

@lunevalex lunevalex requested a review from shermanCRL April 30, 2021 19:52
Copy link
Contributor

@mwang1026 mwang1026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex, @nickelnickel, and @shermanCRL)


advisories/a64325.md, line 29 at r3 (raw file):

This is resolved in CockroachDB by PR [#64324], which fix the race condition between read requests and replica removal.

The fix has been applied to maintenance releases of CockroachDB v20.1, v20.2, and v21.1.

Have we considered mentioning the exact release? I feel like people would ask "what do I need to upgrade to"?

advisories/a64325.md Outdated Show resolved Hide resolved
## Statement
This is resolved in CockroachDB by PR [#64324], which fix the race condition between read requests and replica removal.

The fix has been applied to maintenance releases of CockroachDB v20.1, v20.2, and v21.1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these released? We should say what patch version. It’d be convenient to offer the releases page link here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, we will request extraordinary releases for these next week

@shermanCRL
Copy link
Contributor

Additionally, in the last tech advisory, I added callouts on specific version pages, per historical precedent. Have a look at this PR: #10277

Those updates don’t need to block this PR, but you’ll want to follow up.


Users of CockroachDB v19.2, v20.1, or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version.

This issue may impact the backups of CockroachDB. You can take the following steps to validate backups to make sure they are correct and complete. TODO: @mattsherman.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbardea Would you fill this in? I assume it’s a summary of this: https://github.com/cockroachlabs/support/issues/940#issuecomment-828814334

It’s important to say what the query can or can’t detect.

--

Format-wise, because these instructions are long, maybe they don’t belong in this document.

Should they be added as a comment on the public issue? cockroachdb/cockroach#64325

Thoughts @lunevalex ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will follow up. There are a few queries that can help detect these missing from full backups. I've posted them in https://github.com/cockroachlabs/support/issues/940#issuecomment-830420092. They each have tradeoffs in terms of how many edge cases they consider and the number of false positives returned. They're all fairly clunky, especially to explain in general terms in a TA.

Perhaps the best solution here is to find a way to detect these issues with a high-probability but with minimal false-positives reported?

Of note: I don't think this issue is detectable at all in incremental backups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a better decision not to include this in the advisory for now, too much detail. We’ll use the above comment as a resource for the support team, and think about where a public version would make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can go with a link and keep the advisory simple.

advisories/a64325.md Outdated Show resolved Hide resolved
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that read/write requests have a similar bug that hasn't been fixed yet, PR cockroachdb/cockroach#64471 has been submitted for this.

advisories/a64325.md Outdated Show resolved Hide resolved
_data/advisories.yml Outdated Show resolved Hide resolved
_data/advisories.yml Outdated Show resolved Hide resolved
advisories/a64325.md Outdated Show resolved Hide resolved
advisories/a64325.md Outdated Show resolved Hide resolved
Co-authored-by: Erik Grinaker <[email protected]>
@github-actions
Copy link

github-actions bot commented May 2, 2021

Files changed:

_data/advisories.yml
advisories/a64325.md

2 similar comments
@github-actions
Copy link

github-actions bot commented May 2, 2021

Files changed:

_data/advisories.yml
advisories/a64325.md

@github-actions
Copy link

github-actions bot commented May 2, 2021

Files changed:

_data/advisories.yml
advisories/a64325.md

@lunevalex
Copy link
Contributor Author

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex, @nickelnickel, and @shermanCRL)

advisories/a64325.md, line 29 at r3 (raw file):

This is resolved in CockroachDB by PR [#64324], which fix the race condition between read requests and replica removal.

The fix has been applied to maintenance releases of CockroachDB v20.1, v20.2, and v21.1.

Have we considered mentioning the exact release? I feel like people would ask "what do I need to upgrade to"?

It's mentioned in the mitigation section

@lunevalex lunevalex force-pushed the alex/read-corruption-advisory branch from e1c78c4 to fd6aea1 Compare May 3, 2021 20:26
@github-actions
Copy link

github-actions bot commented May 3, 2021

Files changed:

_data/advisories.yml
advisories/a64325.md


## Impact

This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot where this might be true, please contact them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


## Description

Cockroach Labs has discovered a race condition, where a long running read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Cockroach Labs has discovered a race condition, where a long running read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range.
Cockroach Labs has discovered a race condition, where a read request submitted during replica removal may be evaluated on the removed replica, returning an empty result. Common causes of replica removal are range merges and replica rebalancing in the cluster. When a replica is removed there is a small window, where the data is already deleted but a read request will still see it as a valid replica. In this case a read request will always return an empty result for the part of the query that relied on this range.


## Mitigation

Users of CockroachDB v19.2, v20.1, or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Users of CockroachDB v19.2, v20.1, or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version.
Users of CockroachDB v20.1 or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version.


## Impact

This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot where this might be true, please contact them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This issue may impact all read operations, while the window of opportunity to see is very limited it may have happened to without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot where this might be true, please contact them.
This issue may impact all read operations. While the window of opportunity to see it is very limited, it may have happened without your knowledge. Because this issue affects reads, it may impact backups. Our [support team](https://support.cockroachlabs.com/) has tools to detect and troubleshoot where this might be true, please contact them.

Copy link

@nickelnickel nickelnickel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. My only suggestions are minor:

  • Users of CockroachDB v20.1 or v20.2 are invited to upgrade to v20.1.16, v20.2.9, or a later version.
  • Users of CockroachDB v20.1 or v20.2 are encouraged to upgrade to v20.1.16, v20.2.9, or a later version.
  • Our support team has tools to detect and troubleshoot where this might be true, please contact them.
  • Our support team has tools to detect and troubleshoot this condition. If you would like help to check this in your environment, please contact them.

@lunevalex lunevalex force-pushed the alex/read-corruption-advisory branch from fd6aea1 to bf6df8f Compare May 10, 2021 20:55
@github-actions
Copy link

Files changed:

_data/advisories.yml
advisories/a64325.md

@lunevalex lunevalex merged commit d449e36 into master May 10, 2021
@lnhsingh lnhsingh deleted the alex/read-corruption-advisory branch June 23, 2022 14:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants