Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full recovery mode #14236

Merged
merged 16 commits into from
Oct 23, 2023
Merged

Full recovery mode #14236

merged 16 commits into from
Oct 23, 2023

Conversation

ztlpn
Copy link
Contributor

@ztlpn ztlpn commented Oct 17, 2023

Add "full recovery mode":

  • don't load user partitions
  • disable balancers; pandaproxy and schema registry listeners
  • return errors from produce/consume-related kafka API handlers

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Features

  • Add recovery mode - an option (for the purposes of disaster recovery) to start redpanda in "metadata-only" mode, skipping loading user partitions and allowing only metadata operations. Enabled by the recovery_mode_enabled node config property.

@vbotbuildovich
Copy link
Collaborator

new failures detected in https://buildkite.com/redpanda/redpanda/builds/39141#018b3fb7-001b-4617-b94f-da3a4fe4e894: "rptest.tests.tiered_storage_model_test.TieredStorageTest.test_tiered_storage.cloud_storage_type=CloudStorageType.S3.test_case=.TS_Read==True.TS_TxRangeMaterialized==True.SpilloverManifestUploaded==True"

@ztlpn
Copy link
Contributor Author

ztlpn commented Oct 18, 2023

The error doesn't look related to my changes, opened #14266

Copy link
Contributor

@bharathv bharathv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, all nits/clarification.

src/v/cluster/controller_backend.cc Outdated Show resolved Hide resolved
src/v/kafka/server/request_context.h Outdated Show resolved Hide resolved
@@ -879,6 +879,10 @@ fetch_handler::handle(request_context rctx, ss::smp_service_group ssg) {
octx.response.data.error_code = octx.session_ctx.error();
return std::move(octx).send_response();
}
if (octx.rctx.recovery_mode_enabled()) {
octx.response.data.error_code = error_code::policy_violation;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why policy_violation and what are the client implications? Retryable or it just gives up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a non-recoverable error that semantically looks close to what we need here. Looks like some clients will still retry even in the presence of non-recoverable errors (e.g. rpk produce fails, but rpk consume retries indefinitely), but the hope is that these retries are less eager than for recoverable errors.

src/v/kafka/server/server.cc Outdated Show resolved Hide resolved
@ztlpn
Copy link
Contributor Author

ztlpn commented Oct 20, 2023

changes in force-push: addressed review comments and added group describe test checks in recovery mode.

@emaxerrno
Copy link
Contributor

@ztlpn this is coooool! should this come w/ a list of admin api endpoints that are avail ... perhaps admin api endpoints to trigger GC of segments or smth like that.

@vbotbuildovich
Copy link
Collaborator

@mmaslankaprv
Copy link
Member

/ci-repeat 1

@vbotbuildovich
Copy link
Collaborator

@ztlpn ztlpn merged commit 38c923a into redpanda-data:dev Oct 23, 2023
25 checks passed
@ztlpn
Copy link
Contributor Author

ztlpn commented Oct 23, 2023

@emaxerrno Metadata operations are still available so most of the existing admin API should work without problems. Having additional endpoints for fixing problematic partitions makes sense but is a bit out of scope for this project (as a start, it would be great to at least have an ability to delete them and recovery mode allows this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants