-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: address migration concern with node liveness #54216
Merged
craig
merged 1 commit into
cockroachdb:master
from
irfansharif:200910.liveness-fallback
Sep 11, 2020
Merged
kvserver: address migration concern with node liveness #54216
craig
merged 1 commit into
cockroachdb:master
from
irfansharif:200910.liveness-fallback
Sep 11, 2020
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This was referenced Sep 10, 2020
irfansharif
force-pushed
the
200910.liveness-fallback
branch
from
September 10, 2020 22:16
805afef
to
6088b66
Compare
knz
approved these changes
Sep 11, 2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sad this is necessary, but LGTM
Reviewed 1 of 1 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
In cockroachdb#53842 we introduced a change to always persist a liveness record on start up. As part of that change, we refactored how the liveness heartbeat codepath dealt with missing liveness records: it knew to fetch it from KV given we were now maintaining the invariant that it would always be present. Except that wasn't necessarily true, as demonstrated by the following scenario: ``` // - v20.1 node gets added to v20.1 cluster, and is quickly removed // before being able to persist its liveness record. // - The cluster is upgraded to v20.2. // - The node from earlier is rolled into v20.2, and re-added to the // cluster. // - It's never able to successfully heartbeat (it didn't join // through the join rpc, bootstrap, or gossip). Welp. ``` Though admittedly unlikely, we should handle it all the same instead of simply erroring out. We'll just fall back to creating the liveness record in-place as we did in v20.1 code. We can remove this fallback in 21.1 code. Release note: None
irfansharif
force-pushed
the
200910.liveness-fallback
branch
from
September 11, 2020 14:29
6088b66
to
b394338
Compare
Begone missing liveness records! bors r+ |
Build succeeded: |
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Sep 18, 2020
Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we can change them to assertions to verify that's no longer possible. When retrieving the liveness record from our in-memory cache, it's possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and update our caches while in the area). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. Finally, as a bonus, this PR also surfaces a better error when trying to decommission non-existent nodes. We're able to do this because now we can always assume that a missing liveness record, as seen in the decommission codepath, implies that the user is trying to decommission a non-existent node. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Sep 24, 2020
Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we can change them to assertions to verify that's no longer possible. When retrieving the liveness record from our in-memory cache, it's possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and update our caches while in the area). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. Finally, as a bonus, this PR also surfaces a better error when trying to decommission non-existent nodes. We're able to do this because now we can always assume that a missing liveness record, as seen in the decommission codepath, implies that the user is trying to decommission a non-existent node. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Sep 24, 2020
Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we change them to assertions that verify that empty liveness records are no longer flying around in the system. When retrieving the liveness record from our in-memory cache, it was possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and evebtually update our caches to store this newly read record). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. We should note that the existing cache structure within NodeLiveness is a look-aside cache, and that's not changed. It would further simplify things if it was a look-through cache where the update happened while fetching any record and failing to find it, but we defer that to future work. A TODO outlining this will be introduced in a future commit. A note for ease of review: one structural change introduced in this diff is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and `errLivenessRecordCacheMiss`. The former will be used in a future commit to generate better hints for users (it'll only ever surface when attempting to decommission/recommission non-existent nodes). The latter is used to represent cache misses. This too will be improved in a future commit, where instead of returning a specific error on cache access, we'll return a boolean instead. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None
tbg
pushed a commit
to tbg/cockroach
that referenced
this pull request
Sep 29, 2020
Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we change them to assertions that verify that empty liveness records are no longer flying around in the system. When retrieving the liveness record from our in-memory cache, it was possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and evebtually update our caches to store this newly read record). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. We should note that the existing cache structure within NodeLiveness is a look-aside cache, and that's not changed. It would further simplify things if it was a look-through cache where the update happened while fetching any record and failing to find it, but we defer that to future work. A TODO outlining this will be introduced in a future commit. A note for ease of review: one structural change introduced in this diff is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and `errLivenessRecordCacheMiss`. The former will be used in a future commit to generate better hints for users (it'll only ever surface when attempting to decommission/recommission non-existent nodes). The latter is used to represent cache misses. This too will be improved in a future commit, where instead of returning a specific error on cache access, we'll return a boolean instead. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None
irfansharif
added a commit
to irfansharif/cockroach
that referenced
this pull request
Sep 29, 2020
Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we change them to assertions that verify that empty liveness records are no longer flying around in the system. When retrieving the liveness record from our in-memory cache, it was possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and evebtually update our caches to store this newly read record). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. We should note that the existing cache structure within NodeLiveness is a look-aside cache, and that's not changed. It would further simplify things if it was a look-through cache where the update happened while fetching any record and failing to find it, but we defer that to future work. A TODO outlining this will be introduced in a future commit. A note for ease of review: one structural change introduced in this diff is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and `errLivenessRecordCacheMiss`. The former will be used in a future commit to generate better hints for users (it'll only ever surface when attempting to decommission/recommission non-existent nodes). The latter is used to represent cache misses. This too will be improved in a future commit, where instead of returning a specific error on cache access, we'll return a boolean instead. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None
tbg
pushed a commit
to tbg/cockroach
that referenced
this pull request
Sep 30, 2020
Now that we have cockroachdb#53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we change them to assertions that verify that empty liveness records are no longer flying around in the system. When retrieving the liveness record from our in-memory cache, it was possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and evebtually update our caches to store this newly read record). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. We should note that the existing cache structure within NodeLiveness is a look-aside cache, and that's not changed. It would further simplify things if it was a look-through cache where the update happened while fetching any record and failing to find it, but we defer that to future work. A TODO outlining this will be introduced in a future commit. A note for ease of review: one structural change introduced in this diff is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and `errLivenessRecordCacheMiss`. The former will be used in a future commit to generate better hints for users (it'll only ever surface when attempting to decommission/recommission non-existent nodes). The latter is used to represent cache misses. This too will be improved in a future commit, where instead of returning a specific error on cache access, we'll return a boolean instead. --- We don't intend to backport this to 20.2 due to the hazard described in \cockroachdb#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None
craig bot
pushed a commit
that referenced
this pull request
Sep 30, 2020
54544: kvserver: add assertions for invariants around liveness records r=irfansharif a=irfansharif Now that we have #53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we can change them to assertions to verify that's no longer possible. When retrieving the liveness record from our in-memory cache, it's possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and update our caches while in the area). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. Finally, as a bonus, this PR also surfaces a better error when trying to decommission non-existent nodes. We're able to do this because now we can always assume that a missing liveness record, as seen in the decommission codepath, implies that the user is trying to decommission a non-existent node. --- We don't intend to backport this to 20.2 due to the hazard described in \#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None 54812: docker: Base the docker image on RedHat UBI r=bdarnell,DuskEagle a=jlinder Before: The docker image was based on Debian 9.12 slim. Why: This change will help on-prem customers from a security and compliance perspective. It also aligns with our publishing images into the RedHat Marketplace. Now: Published docker images are based on the RedHat UBI 8 base image. Fixes: #49643 Release note (backward-incompatible change): CockroachDB Docker images are now based on the RedHat ubi8/ubi base image instead of Debian 9.12 slim. This will help on-prem customers from a security and compliance perspective. Co-authored-by: irfan sharif <[email protected]> Co-authored-by: James H. Linder <[email protected]>
jayshrivastava
pushed a commit
that referenced
this pull request
Oct 8, 2020
Now that we have #53842, we maintain the invariant that there always exists a liveness record for any given node. We can now simplify our handling of liveness records internally: where previously we had code to handle the possibility of empty liveness records (we created a new one on the fly), we change them to assertions that verify that empty liveness records are no longer flying around in the system. When retrieving the liveness record from our in-memory cache, it was possible for us to not find anything due to gossip delays. Instead of simply giving up then, now we can read the records directly from KV (and evebtually update our caches to store this newly read record). This PR introduces this mechanism through usage of `getLivenessRecordFromKV`. We should note that the existing cache structure within NodeLiveness is a look-aside cache, and that's not changed. It would further simplify things if it was a look-through cache where the update happened while fetching any record and failing to find it, but we defer that to future work. A TODO outlining this will be introduced in a future commit. A note for ease of review: one structural change introduced in this diff is breaking down `ErrNoLivenessRecord` into `ErrMissingLivenessRecord` and `errLivenessRecordCacheMiss`. The former will be used in a future commit to generate better hints for users (it'll only ever surface when attempting to decommission/recommission non-existent nodes). The latter is used to represent cache misses. This too will be improved in a future commit, where instead of returning a specific error on cache access, we'll return a boolean instead. --- We don't intend to backport this to 20.2 due to the hazard described in \#54216. We want this PR to bake on master and (possibly) trip up the assertions added above if we've missed anything. They're the only ones checking for the invariant we've introduced around liveness records. That invariant will be depended on for long running migrations, so better to shake things out early. Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In #53842 we introduced a change to always persist a liveness record on
start up. As part of that change, we refactored how the liveness
heartbeat codepath dealt with missing liveness records: it knew to fetch
it from KV given we were now maintaining the invariant that it would
always be present. Except that wasn't necessarily true, as demonstrated
by the following scenario:
Though admittedly unlikely, we should handle it all the same instead of
simply erroring out. We'll just fall back to creating the liveness
record in-place as we did in v20.1 code. We can remove this fallback in
21.1 code.
First commit is from #54224.
Release note: None