Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node wise recovery #13943

Merged
merged 14 commits into from
Nov 30, 2023
Merged

node wise recovery #13943

merged 14 commits into from
Nov 30, 2023

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Oct 5, 2023

PRD: link

Given an input set of nodes, generates a move plan that force reconfigures all partitions that lost majority (or all replicas) due to the unavailability of these nodes. This is intended to be used as an escape hatch to bulk move all such majority lost partitions when nodes are dead and are irrecoverable.

Adds the following admin end points. Input for both is array of node_ids.

  • GET /v1/partitions/majority_lost?defunct_nodes=<csv nodes> - Given an input set of defunct nodes, generates a list of partitions that should be force recovered (aka lost majority).

  • POST /v1/partitions/force_recover_from_nodes - Submits the plan generated from above step and the partitions are eventually force recovered.

Example invocations

curl -X GET http://localhost:9644/v1/partitions/majority_lost\?defunct_nodes\=0,1 

curl -X POST http://localhost:9644/v1/partitions/force_recover_from_nodes -H 'Content-Type: application/json' --data '{"defunct_nodes": [1], "partitions_to_force_recover": [{"ntp": {"ns": "kafka", "topic": "bar", "partition": 0}, "topic_revision": 26, "replicas": [{"node_id": 1, "core": 0}], "defunct_nodes": [1]}]}

partitions_to_force_recover part of the POST payload is the plan generated from GET. The plan is validated before generating the controller command that orchestrates the moves.

Monitoring: Balanacer status (v1/cluster/partition_balancer/status) now contains the following

  • partitions_pending_force_recovery_count - # of partitions that are yet to be force recovered.
  • partitions_pending_force_recovery_sample - A sample list of partitions pending force recovery (limit capped to 10)

Notes on defunct nodes:

  • This PR adds the concept of "defunct nodes", nodes that are not functional and explicitly marked defunct by the user.
  • Defunct nodes are generated from force_replicas_from_nodes commands, all the brokers marked as "defunct" in the request are marked defunct for future processing and this is irreversible for now.
  • defunct nodes are considered unavailable from RP's perspective. This means replicas from these nodes are moved away by the balancer eventually and any list of user approved force recoverable partitions in the request are force reconfigured.
  • liveness status of a broker (functional or defunct) can be checked in the /brokers end point.

Fixes: https://github.com/redpanda-data/planning/issues/84

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Features

  • Adds node wise partition recovery functionality. Given an input set of node ids, force reconfigures all partitions that would lose majority if the input set of nodes are dead. Intended to bulk recover partitions that are stuck when majority of brokers hosting their replicas are dead and irrecoverable.

@vbotbuildovich
Copy link
Collaborator

@vbotbuildovich
Copy link
Collaborator

@twmb
Copy link
Contributor

twmb commented Nov 3, 2023

GET with a request body is questionable in HTTP, "A payload within a GET request message has no defined semantics; sending a payload body on a GET request might cause some existing implementations to reject the request."

Rather than trying to use methods to determine a dry run, what about using a query parameter?
Ref: https://stackoverflow.com/questions/58831345/how-would-you-design-a-rest-api-that-includes-the-simulation-of-a-post and the linked k8s doc, which uses a dryRun QP

Also,
"""
partitions with at least one surviving replica are reconfigured/downsized to surviving replica set. ex: if 1 of 3 replica remains, the new target assignment only consists of 1 replica
"""
Does this mean the RF for the partition is changed? I feel like we should separate changing the RF from forceful-move-away-from-node.

Also, does this move all partitions or only unhealthy ones?

@bharathv bharathv marked this pull request as draft November 16, 2023 09:39
@bharathv
Copy link
Contributor Author

Temporarily moving to draft while I polish and get tests passing with the new upcoming revision.

@bharathv
Copy link
Contributor Author

GET with a request body is questionable in HTTP, "A payload within a GET request message has no defined semantics; sending a payload body on a GET request might cause some existing implementations to reject the request."

Rather than trying to use methods to determine a dry run, what about using a query parameter?
Ref: https://stackoverflow.com/questions/58831345/how-would-you-design-a-rest-api-that-includes-the-simulation-of-a-post and the linked k8s doc, which uses a dryRun QP

Thanks, TIL.. thats a bummer.. I tend to agree with elastic's sentiment on this, let me figure out a way to encode the list of broker ids into a query parameter.

Does this mean the RF for the partition is changed? I feel like we should separate changing the RF from forceful-move-away-from-node.

Any particular reason? In a way we are preserving existing RF so the user doesn't have to run a separate command to up-replicate again.

Also, does this move all partitions or only unhealthy ones?

both (in the upcoming rev). The nodes in the list are marked "defunct" from RP perspective and the balancer tries to move replicas away gracefully (for those that have the majority) and forcefully reconfigure the rest (after a user approval).

@twmb
Copy link
Contributor

twmb commented Nov 17, 2023

In a way we are preserving existing RF
But doesn't the PR description say that if we can't preserve the original RF, then the RF is automatically downsized?

@bharathv bharathv force-pushed the node_wise_recovery branch 3 times, most recently from 941b306 to b49f996 Compare November 20, 2023 09:03
@bharathv bharathv marked this pull request as ready for review November 20, 2023 09:03
@bharathv
Copy link
Contributor Author

failures are unrelated (changed some RPC structs, need to update compat test, will do it along with next rev), ready for review.

@bharathv bharathv added this to the v23.3.1-rc2 milestone Nov 29, 2023
@bharathv bharathv force-pushed the node_wise_recovery branch 2 times, most recently from f75a059 to 54c6415 Compare November 29, 2023 06:05
Can only be submitted on the controller leader, would be simpler to
redirect rather than expecting the caller to submit to controller
leader.
Given an input set of dead nodes, generates a list of partition instances
that has majority replicas on these dead nodes. This is a read-only API
and makes no chanages to the metadata state.
example

curl -X GET http://localhost:9644/v1/partitions/majority_lost\?defunct_nodes\=0,1

Input query parameter a comma separated list of node id integers.
A broker is by default in functional state unless explicitly marked
defunct by user. Defunct state indicates that the broker is no longer
reachable and the data is irrecoverable. This is currently irreversible.
A defunct broker can only be decommissioned.
handle_ntp_move_begin_or_cancel
This ensures that replicas as drained from the unavailable nodes.
These ntps are force reconfigured as requested by the user.
Introduces a new partition variant called 'force_reassignable_partition'
that are dealt with in a separate pass.
Adds additional validation on the input set of nodes.
A Document is also a ValueType, passing a ValueType ensures that we can
validate any ValueType which includes even a subset of a json (to be
used in a later commit).
Includes the # of pending force reconfigurations and a sample of
10 ntps from that list in the balancer status page.
reusing feature flag from enhanced force reconfiguration.
@bharathv
Copy link
Contributor Author

Rebased and force pushed to fix conflicts.

const auto& assignments = (it->second).get_assignments();
const auto topic_revision = it->second.get_revision();
for (const auto& assignment : assignments) {
const auto& current = assignment.replicas;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plase add a it.check() here

@mmaslankaprv mmaslankaprv self-requested a review November 30, 2023 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants