-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: delete node status entries during decommissioning #56529
server: delete node status entries during decommissioning #56529
Conversation
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. Before a member of our team reviews your PR, I have some potential action items for you:
I have added a few people who may be able to assist in reviewing:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Thanks @erikgrinaker and sorry it's taking me so long to look at this. It turns out that there's actually a near-term desire to rely on these status keys internally a bit more. @nvanbenschoten and/or @aayushshah15, could you provide pointers on what's needed there?
That's not a bad idea, though in practice it's still going to be a little crufty (as will anything that requires sync between the update loop and the record). For example, we want a decommissioning node to keep writing these updates (as the UI uses them to display node info) and only stop on decommissioned, which is a very short time interval. Unless we introduce a waiting period, it's bound to accidentally violate the desired ordering. For an alternative solution, we could add an RPC endpoint that stops the node status loop (i.e. a |
No rush! I agree that there aren't really any obvious good ways to solve this. I think I'm leaning towards either check-on-read or a reconciliation loop (possibly piggybacking on some other cleanup loop), since those will always converge on a correct state, eventually. But happy to implement the RPC approach if that's what we'd prefer. A good solution would probably involve improving the capabilities of this data store somehow, not sure what they currently are and what plans (if any) we have for it - will read up on it a bit tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @erikgrinaker, this looks like a great start! Thanks for putting it together and sorry my review is coming a little late.
Performing a best-effort removal of the NodeStatusKey
when decommissioning is the right first step here. I left a comment below about a potential change we could make to allow for manual retries in the presence of node failures or other interruptions. Given the limited set of places that we intend to use NodeStatusKey
s for #51304 (comment), this may actually be sufficient for our use case.
Past that, it sounds like the biggest blocker towards clearer semantics is that there is no coordination with the MetricsRecorder
, so any work we do to delete the node's status key may be undone due to a race. The most natural way to avoid these kinds of issues would be to replace the blind put in MetricsRecorder
with a conditional put. If writes to the key from the startWriteNodeStatus
loop (but not from initializeAdditionalStores
) were to atomically check that some value already existed for the key, then I think we'd be in the clear. The immediate issue with this is that we don't have an Inline
flag on ConditionalPutRequest
. I don't think there's any strong reason for this though, other than that we have never needed one (@tbg does that sound right?). So we could consider:
- adding inline support to
ConditionalPutRequest
- either querying or remembering the previous value for the node's
NodeStatusKey
instartWriteNodeStatus
- conditionally writing to the key and logging a warning on
ConditionFailedErrors
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/server/server.go, line 2092 at r1 (raw file):
// the status entries if we crash or fail. Note that the status // entry is an inline value, thus this cannot be transactional. if targetStatus.Decommissioned() {
While this can't be transactional, I wonder if we can leverage the fact that decommissioning is idempotent to permit crashes/failures that prevent the NodeStatus key from being deleted to be retried. If we pulled this operation out of the if statusChanged
block, would that be enough to support this form of manual reconciliation?
c9dd70c
to
900edb3
Compare
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
I like this idea. Since there's no great way to ensure this automatically, having some way to manually repair this in the case of failure seems like a reasonable first step -- have updated the PR with this.
Awesome! I was thinking we could use some sort of optimistic concurrency control, but put the idea aside since I didn't think the KV store supported it (given that it also uses MVCC) -- great to learn that this is possible after all, thanks for the tip! I'll go ahead and implement this. One problem with this approach is that if the original write in |
Hm, I'm actually not sure if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One problem with this approach is that if the original write in initializeAdditionalStores fails (which would just be logged and ignored) then the status key will never be written. Maybe we should return an error for this?
Returning an error there seems reasonable to me. That way, we can guarantee that an initialized store will have a corresponding node status key.
Hm, I'm actually not sure if ConditionalPut can be implemented safely for inline values. Since inline values do not have a timestamp, and Pebble does not appear to support transactions, OCC, or locks, I don't think there's any way to atomically read and write the value. A concurrent writer (e.g. the decommissioner) could therefore modify it in the meanwhile. Or am I missing something?
Everything you said is true, but what you're missing is that CRDB has a lightweight latching mechanism that provides request-level isolation, even to inline and non-transaction operations. This is enough to synchronize with other inline ConditionalPut operations. To read up on this, see https://github.com/cockroachdb/cockroach/blob/13ce0e8db15efd201af782322669f0542c40b29a/pkg/kv/kvserver/concurrency/concurrency_control.go and https://github.com/cockroachdb/cockroach/blob/13ce0e8db15efd201af782322669f0542c40b29a/pkg/kv/kvserver/spanlatch/doc.go.
To use this, we'll need to configure the inline ConditionalPut to support latching just like we do with the inline Put operation. See
cockroach/pkg/kv/kvserver/batcheval/cmd_put.go
Lines 34 to 38 in 13ce0e8
if args.Inline { | |
DefaultDeclareKeys(desc, header, req, latchSpans, lockSpans) | |
} else { | |
DefaultDeclareIsolatedKeys(desc, header, req, latchSpans, lockSpans) | |
} |
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
Hi Erik! cockroach/pkg/server/serverpb/status.proto Line 994 in 9e4ec7d
|
Sure thing @andreimatei! Been busy with some other things lately, but will pick this up over the weekend. |
900edb3
to
93cbbd7
Compare
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
93cbbd7
to
f694a7f
Compare
@tbg @nvanbenschoten Thanks for the input, I think this should be ready for review now. We do a best-effort removal during decommissioning, on failure the operation must be rerun to remove the stale entry. The status recorder loop uses a conditional put to update the value only if it already exists, and we unconditionally write a status entry on node startup with a fatal error on failure. This may still leave a race condition when decommissioning a node that happens to be starting up (where it's decommissioned between checking the decommissioned status and writing the status entry), but I feel like that's going pretty far into edge case territory. In any case, it can be repaired using the same approach as for crashes, by rerunning the decommission operation. Let me know what you think. |
f694a7f
to
481fb1b
Compare
481fb1b
to
2ec4751
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Very nice job @erikgrinaker. There are just a few nits remaining before we can merge this, nothing serious.
Reviewed 10 of 10 files at r2, 10 of 10 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/kv/batch.go, line 427 at r2 (raw file):
} // CPutInline conditionally sets the value for a key if the existing value is
nit: drop this below CPutAllowingIfNotExists
to keep the transaction versions of CPut grouped together.
pkg/kv/client_test.go, line 379 at r2 (raw file):
value := []byte("value") // Should fail on non-existent key with expected value
nit: in accordance with the Go style guide, we generally try to use proper punctuation in comments whenever possible.
pkg/server/decommission_test.go, line 61 at r3 (raw file):
ctx, livenesspb.MembershipStatus_DECOMMISSIONED, []roachpb.NodeID{decomNodeID})) // The node status entries should shortly be cleaned up.
I might be thinking about this incorrectly, but shouldn't this have been performed synchronously? In other words - why do we need the SucceedsSoon
?
pkg/server/node.go, line 585 at r3 (raw file):
// Write a new status summary after all stores have been initialized; this // helps the UI remain responsive when new nodes are added. if err := n.writeNodeStatus(ctx, 0 /* alertTTL */, false); err != nil {
nit: add a /* mustExist */
comment after the false. See the "Avoid Naked Parameters" section of our extended style guide for the reason why: https://wiki.crdb.io/wiki/spaces/CRDB/pages/181371303/Go+coding+guidelines.
Same thing down below.
pkg/server/node.go, line 725 at r3 (raw file):
// // The status key must already exist, to avoid race conditions // during node decommissioning.
Consider expanding this comment a bit to mention that this process may not even be the one performing the decommissioning.
2ec4751
to
9d20cf7
Compare
Thanks for the review and suggestions @nvanbenschoten, I've updated the PR with fixes. The CI failure appears unrelated. |
Yeah that seems excessive. You are calling that method, right? Otherwise - unexport. If you're calling it, you could still rename it (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 10 of 21 files at r8, 11 of 11 files at r9.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/server/status/recorder.go, line 511 at r7 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This doesn't seem like the right error to return (since we're not in any code path where the caller knows do to anything with a structured KV error). Don't you want something like
return errors.New("this node was decomissioned")
This is a good point. What do you think about catching ConditionFailedError
s coming from CPutInline and wrapping them with similar context about a concurrent update (if ActualValue != nil
) and a concurrent decommission (if ActualValue == nil
).
The KV server supports regular put operations for inline (non-versioned) values, but not conditional put operations as there hasn't been a need for them yet. These are now needed for cockroachdb#51304, to avoid a race condition where a decommissioned node can accidentally resurrect its status entry. Even though inline values are not versioned, the KV server provides request-level key-scoped isolation, which allows the value to be atomically compared and swapped during a conditional put operation. This commit adds the APIs `DB.CPutInline()` and `Batch.cPutInline()`, as well as the Protobuf field `ConditionalPutRequest.inline` and corresponding constructor `roachpb.NewConditionalPutInline()`, to perform conditional puts for inline values. It also adds the version gate `clusterversion.CPutInline`, but the client APIs do not check this automatically as they don't have access to cluster settings. Instead, `DB.CPutInline()` requires the given context to be wrapped with `kv.CtxForCPutInline()` to enable the feature, and `Batch.cPutInline()` is temporarily unexported. Release note: none
98d6f49
to
c4f6bf0
Compare
Done. 👍
For sure -- should've done that when I changed the other error, but slipped my mind. Done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pushing through the migration concerns.
I'll merge on green.
Reviewed 5 of 16 files at r10, 11 of 11 files at r11.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
c4f6bf0
to
24d020d
Compare
Bumped this again, misinterpreted what you meant with the |
Decommissioning a node did not remove its node status entry in the `status-node-` keyspace. This caused the decommissioned node to remain visible via e.g. `Status/Nodes` gRPC calls, affecting other systems that did not check against the node's liveness entry. This commit deletes a node's status entry when it is being decommissioned. Unfortunately, since the status entry is inline this cannot be transactional with the liveness entry, which can cause a stale status entry to be left behind if the operation should fail or the node should crash. In these cases, the decommission operation (which is idempotent) must be run again to remove it. To avoid a race condition where the decommissioned node's status recorder loop resurrects the status entry after it has been removed, this also changes the recorder to only update the status entry if it already exists. The initial status entry is written during node startup, and the node will now fail to start if the entry cannot be written. Release note (bug fix): remove a node's status entry when the node is decommissioned, to prevent it from appearing in API calls and UIs, and avoid it affecting node constraints such as localities and attributes for various operations.
24d020d
to
090eef9
Compare
bors r+ |
Build succeeded: |
Previously, the decommissioned node lists considered node status entries to determine decommissioning and decommissioned status. This changed in cockroachdb#56529, resulting in empty lists. Now, the node's liveness entry is considered and these lists are correctly populated. Release note (bug fix): The list of recently decommissioned nodes and the historical list of decommissioned nodes correctly display decommissioned nodes. fix tests
Previously, the decommissioned node lists considered node status entries to determine decommissioning and decommissioned status. This changed in cockroachdb#56529, resulting in empty lists. Now, the node's liveness entry is considered and these lists are correctly populated. Release note (bug fix): The list of recently decommissioned nodes and the historical list of decommissioned nodes correctly display decommissioned nodes. fix tests fix pb import and add default arg value
…76591 75809: [CRDB-12226] server, ui: display circuit breakers in problem ranges and range status r=Santamaura a=Santamaura This PR adds changes to the reports/problemranges and reports/range pages. Ranges with replicas that have a circuit breaker will show up as problem ranges and the circuit breaker error will show up as a row on the status page. Release note (ui change): display circuit breakers in problems ranges and range status Problem Ranges page: ![Screen Shot 2022-02-08 at 4 57 51 PM](https://user-images.githubusercontent.com/17861665/153082648-6c03d195-e395-456a-be00-55ad24863752.png) Range status page: ![Screen Shot 2022-02-08 at 4 57 34 PM](https://user-images.githubusercontent.com/17861665/153082705-cbbe5507-e81d-49d7-a3f7-21b4c84226c2.png) 76278: Add cluster version as feature gate for block properties r=dt,erikgrinaker,jbowens a=nicktrav Two commits here - the first adds a new cluster version, the second makes use of the cluster version as a feature gate (and update various call sites all over the place). --- pkg/clusterversion: add new version as feature gate for block properties Prior to this change, the block properties SSTable-level feature was enabled in a single cluster version. This introduced a subtle race in that for the period in which each node is being updated to `PebbleFormatBlockPropertyCollector`, there is a brief period where not all nodes are at the same cluster version, and thus store versions. If nodes at the newer version write SSTables that make use of block properties, and these tables are consumed by nodes that are yet to be updated, the older nodes could panic when attempting to read the tables with a format they do not yet understand. While this race is academic, given that a) there are now subsequent cluster versions that act as barriers during the migration, and b) block properties were disabled in 1a8fb57, this patch addresses the race by adding a second cluster version. `EnablePebbleFormatVersionBlockProperties` acts as a barrier and a feature gate. A guarantee of the migration framework is that any node at this newer version is part of a cluster that has already run the necessary migrations for the older version, and thus ratcheted the format major version in the store, and thus enabled the block properties feature, across all nodes. Add additional documentation in `pebble.go` that details how to make use of the two-version pattern for future table-level version changes. --- pkg/storage: make MakeIngestionWriterOptions version aware With the introduction of block properties, and soon, range keys, which introduce backward-incompatible changes at the SSTable level, all nodes in a cluster must all have a sufficient store version in order to avoid runtime incompatibilities. Update `storage.MakeIngestionWriterOptions` to add a `context.Context` and `cluster.Settings` as parameters, which allows for determining whether a given cluster version is active (via `(clusterversion.Handle).IsActive()`). This allows gating the enabling / disabling of block properties (and soon, range keys), on all nodes being at a sufficient cluster version. Update various call-sites to pass in the `context.Context` and `cluster.Settings`. --- 76348: ui: downsample SQL transaction metrics using MAX r=maryliag a=dhartunian Previously, we were using the default downsampling behavior of the timeseries query engine for "Open SQL Transactions" and "Active SQL Statements" on the metrics page in DB console. This led to confusion when zooming in on transaction spikes since the spike would get larger as the zoom got tighter. This PR changes the aggregation function to use MAX to prevent this confusion. Resolves: #71827 Release note (ui change): Open SQL Transactions and Active SQL Transactions are downsampled using MAX instead of AVG and will more accurately reflect narrow spikes in transaction counts when looking and downsampled data. 76414: spanconfig: teach the KVAccessor about system span configurations r=arulajmani a=arulajmani First 3 commits are from #76219, this one's quite small -- mostly just tests. ---- This patch teaches the KVAccessor to update and get system span configurations. Release note: None 76538: ui: Use liveness info to populate decommissioned node lists r=zachlite a=zachlite Previously, the decommissioned node lists considered node status entries to determine decommissioning and decommissioned status. This changed in #56529, resulting in empty lists. Now, the node's liveness entry is considered and these lists are correctly populated. Release note (bug fix): The list of recently decommissioned nodes and the historical list of decommissioned nodes correctly display decommissioned nodes. 76544: builtins: add rehome_row to DistSQLBlocklist r=mgartner,otan a=rafiss fixes #76153 This builtin always needs to run on the gateway node. Release note: None 76546: build: display pebble git SHA in GitHub messages r=rickystewart a=nicktrav Use the short from of the Git SHA from the go.mod-style version in DEPS.bzl as the Pebble commit. This ensures that failure messages created by Team City link to a GitHub page that renders correctly. Release note: None 76550: gen/genbzl: general improvements r=ajwerner a=ajwerner This change does a few things: * It reworks the queries in terms of eachother in-memory. This is better than the previous iteration whereby it'd generate the results and then rely on the output of that query. Instead, we just build up bigger query expressions and pass them to bazel using the --query_file flag. * It avoids exploring the pkg/ui directory (and the pkg/gen directory) because those can cause problems. The pkg/ui directory ends up bringing in npm, which hurts. * It stops rewriting the files before executing the queries. It no longer needs to rewrite them up front because they aren't referenced by later queries. * It removes the excluded target which was problematic because those files weren't properly visible. Fixes #76521 Fixes #76503 Release note: None 76562: ccl/sqlproxyccl: update PeekMsg to return message size instead of body size r=JeffSwenson a=jaylim-crl Informs #76000. Follow-up to #76006. Previously, PeekMsg was returning the body size (excluding header size), which is a bit awkward from an API point of view because most callers of PeekMsg immediately adds the header size to the returned size previously. This commit cleans the API design up by making PeekMsg return the message size instead, i.e. header inclusive. At the same time, returning the message size makes it consistent with the ReadMsg API since that returns the entire message. Release note: None 76591: bazel: update shebang line in `sql-gen.sh` r=rail a=rickystewart Release note: None Co-authored-by: Santamaura <[email protected]> Co-authored-by: Nick Travers <[email protected]> Co-authored-by: David Hartunian <[email protected]> Co-authored-by: arulajmani <[email protected]> Co-authored-by: Zach Lite <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Jay <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
Previously, the decommissioned node lists considered node status entries to determine decommissioning and decommissioned status. This changed in cockroachdb#56529, resulting in empty lists. Now, the node's liveness entry is considered and these lists are correctly populated. Release note (bug fix): The list of recently decommissioned nodes and the historical list of decommissioned nodes correctly display decommissioned nodes. fix tests fix pb import and add default arg value
Previously, the decommissioned node lists considered node status entries to determine decommissioning and decommissioned status. This changed in cockroachdb#56529, resulting in empty lists. Now, the node's liveness entry is considered and these lists are correctly populated. Release note (bug fix): The list of recently decommissioned nodes and the historical list of decommissioned nodes correctly display decommissioned nodes. fix tests fix pb import and add default arg value
to determine decommissioning and decommissioned status. This changed in cockroachdb#56529, resulting in empty lists. Now, the node's liveness entry is considered and these lists are correctly populated. Release note (bug fix): The list of recently decommissioned nodes and the historical list of decommissioned nodes correctly display decommissioned nodes. lint fix
to determine decommissioning and decommissioned status. This changed in cockroachdb#56529, resulting in empty lists. Now, the node's liveness entry is considered and these lists are correctly populated. Release note (bug fix): The list of recently decommissioned nodes and the historical list of decommissioned nodes correctly display decommissioned nodes. lint fix fix test
Decommissioning a node did not remove its node status entry in the
status-node-
keyspace. This caused the decommissioned node to remainvisible via e.g.
Status/Nodes
gRPC calls, affecting other systems thatdid not check against the node's liveness entry.
This commit deletes a node's status entry when it is being
decommissioned. Unfortunately, since the status entry is inline this
cannot be transactional with the liveness entry, which can cause a stale
status entry to be left behind if the operation should fail or the node
should crash. In these cases, the decommission operation (which is
idempotent) must be run again to remove it.
To avoid a race condition where the decommissioned node's status
recorder loop resurrects the status entry after it has been removed,
this also changes the recorder to only update the status entry if it
already exists. The initial status entry is written during node startup,
and the node will now fail to start if the entry cannot be written.
Release note (bug fix): remove a node's status entry when the node is
decommissioned, to prevent it from appearing in API calls and UIs, and
avoid it affecting node constraints such as localities and attributes
for various operations.
Resolves #51304.
This also adds support for conditional puts of inline values, as a separate commit -- see commit message for details.