node wise recovery #13943

bharathv · 2023-10-05T04:38:27Z

Given an input set of nodes, generates a move plan that force reconfigures all partitions that lost majority (or all replicas) due to the unavailability of these nodes. This is intended to be used as an escape hatch to bulk move all such majority lost partitions when nodes are dead and are irrecoverable.

Adds the following admin end points. Input for both is array of node_ids.

GET /v1/partitions/majority_lost?defunct_nodes=<csv nodes> - Given an input set of defunct nodes, generates a list of partitions that should be force recovered (aka lost majority).
POST /v1/partitions/force_recover_from_nodes - Submits the plan generated from above step and the partitions are eventually force recovered.

Example invocations

curl -X GET http://localhost:9644/v1/partitions/majority_lost\?defunct_nodes\=0,1 

curl -X POST http://localhost:9644/v1/partitions/force_recover_from_nodes -H 'Content-Type: application/json' --data '{"defunct_nodes": [1], "partitions_to_force_recover": [{"ntp": {"ns": "kafka", "topic": "bar", "partition": 0}, "topic_revision": 26, "replicas": [{"node_id": 1, "core": 0}], "defunct_nodes": [1]}]}

partitions_to_force_recover part of the POST payload is the plan generated from GET. The plan is validated before generating the controller command that orchestrates the moves.

Monitoring: Balanacer status (v1/cluster/partition_balancer/status) now contains the following

partitions_pending_force_recovery_count - # of partitions that are yet to be force recovered.
partitions_pending_force_recovery_sample - A sample list of partitions pending force recovery (limit capped to 10)

Notes on defunct nodes:

This PR adds the concept of "defunct nodes", nodes that are not functional and explicitly marked defunct by the user.
Defunct nodes are generated from force_replicas_from_nodes commands, all the brokers marked as "defunct" in the request are marked defunct for future processing and this is irreversible for now.
defunct nodes are considered unavailable from RP's perspective. This means replicas from these nodes are moved away by the balancer eventually and any list of user approved force recoverable partitions in the request are force reconfigured.
liveness status of a broker (functional or defunct) can be checked in the /brokers end point.

Fixes: https://github.com/redpanda-data/planning/issues/84

Backports Required

Release Notes

Features

Adds node wise partition recovery functionality. Given an input set of node ids, force reconfigures all partitions that would lose majority if the input set of nodes are dead. Intended to bulk recover partitions that are stuck when majority of brokers hosting their replicas are dead and irrecoverable.

vbotbuildovich · 2023-10-05T06:58:38Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/38327#018afe5f-7762-4503-a150-c716cc166f9a

vbotbuildovich · 2023-10-05T07:46:07Z

ducktape was retried in job https://buildkite.com/redpanda/redpanda/builds/38327#018afe6e-270d-4d05-af8b-98baefa51760

src/v/cluster/cluster_utils.h

src/v/cluster/topics_frontend.cc

twmb · 2023-11-03T20:13:45Z

GET with a request body is questionable in HTTP, "A payload within a GET request message has no defined semantics; sending a payload body on a GET request might cause some existing implementations to reject the request."

Rather than trying to use methods to determine a dry run, what about using a query parameter?
Ref: https://stackoverflow.com/questions/58831345/how-would-you-design-a-rest-api-that-includes-the-simulation-of-a-post and the linked k8s doc, which uses a dryRun QP

Also,
"""
partitions with at least one surviving replica are reconfigured/downsized to surviving replica set. ex: if 1 of 3 replica remains, the new target assignment only consists of 1 replica
"""
Does this mean the RF for the partition is changed? I feel like we should separate changing the RF from forceful-move-away-from-node.

Also, does this move all partitions or only unhealthy ones?

bharathv · 2023-11-16T09:39:40Z

Temporarily moving to draft while I polish and get tests passing with the new upcoming revision.

bharathv · 2023-11-16T09:45:21Z

GET with a request body is questionable in HTTP, "A payload within a GET request message has no defined semantics; sending a payload body on a GET request might cause some existing implementations to reject the request."

Rather than trying to use methods to determine a dry run, what about using a query parameter?
Ref: https://stackoverflow.com/questions/58831345/how-would-you-design-a-rest-api-that-includes-the-simulation-of-a-post and the linked k8s doc, which uses a dryRun QP

Thanks, TIL.. thats a bummer.. I tend to agree with elastic's sentiment on this, let me figure out a way to encode the list of broker ids into a query parameter.

Does this mean the RF for the partition is changed? I feel like we should separate changing the RF from forceful-move-away-from-node.

Any particular reason? In a way we are preserving existing RF so the user doesn't have to run a separate command to up-replicate again.

Also, does this move all partitions or only unhealthy ones?

both (in the upcoming rev). The nodes in the list are marked "defunct" from RP perspective and the balancer tries to move replicas away gracefully (for those that have the majority) and forcefully reconfigure the rest (after a user approval).

twmb · 2023-11-17T00:04:17Z

In a way we are preserving existing RF
But doesn't the PR description say that if we can't preserve the original RF, then the RF is automatically downsized?

bharathv · 2023-11-21T05:23:28Z

failures are unrelated (changed some RPC structs, need to update compat test, will do it along with next rev), ready for review.

src/v/cluster/types.cc

src/v/redpanda/admin/api-doc/partition.json

src/v/cluster/cluster_utils.h

src/v/redpanda/admin/server.cc

src/v/cluster/topics_frontend.cc

Can only be submitted on the controller leader, would be simpler to redirect rather than expecting the caller to submit to controller leader.

Given an input set of dead nodes, generates a list of partition instances that has majority replicas on these dead nodes. This is a read-only API and makes no chanages to the metadata state.

example curl -X GET http://localhost:9644/v1/partitions/majority_lost\?defunct_nodes\=0,1 Input query parameter a comma separated list of node id integers.

A broker is by default in functional state unless explicitly marked defunct by user. Defunct state indicates that the broker is no longer reachable and the data is irrecoverable. This is currently irreversible. A defunct broker can only be decommissioned.

handle_ntp_move_begin_or_cancel

This ensures that replicas as drained from the unavailable nodes.

These ntps are force reconfigured as requested by the user. Introduces a new partition variant called 'force_reassignable_partition' that are dealt with in a separate pass.

Adds additional validation on the input set of nodes.

A Document is also a ValueType, passing a ValueType ensures that we can validate any ValueType which includes even a subset of a json (to be used in a later commit).

Includes the # of pending force reconfigurations and a sample of 10 ntps from that list in the balancer status page.

reusing feature flag from enhanced force reconfiguration.

bharathv · 2023-11-29T13:06:51Z

Rebased and force pushed to fix conflicts.

mmaslankaprv · 2023-11-30T08:18:05Z

src/v/cluster/topics_frontend.cc

+            const auto& assignments = (it->second).get_assignments();
+            const auto topic_revision = it->second.get_revision();
+            for (const auto& assignment : assignments) {
+                const auto& current = assignment.replicas;


plase add a it.check() here

github-actions bot added the area/redpanda label Oct 5, 2023

bharathv requested review from rystsov, ztlpn and mmaslankaprv October 5, 2023 04:38

mmaslankaprv reviewed Oct 5, 2023

View reviewed changes

src/v/cluster/cluster_utils.h Outdated Show resolved Hide resolved

mmaslankaprv reviewed Oct 5, 2023

View reviewed changes

src/v/cluster/topics_frontend.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Oct 5, 2023

View reviewed changes

src/v/cluster/topics_frontend.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Oct 5, 2023

View reviewed changes

src/v/cluster/topics_frontend.cc Outdated Show resolved Hide resolved

r-vasquez mentioned this pull request Nov 3, 2023

rpk: add unsafe partition recovery #14744

Closed

bharathv marked this pull request as draft November 16, 2023 09:39

bharathv force-pushed the node_wise_recovery branch from 1adbc49 to ebcfc9e Compare November 16, 2023 09:46

bharathv force-pushed the node_wise_recovery branch 3 times, most recently from 941b306 to b49f996 Compare November 20, 2023 09:03

bharathv requested a review from mmaslankaprv November 20, 2023 09:03

bharathv marked this pull request as ready for review November 20, 2023 09:03