Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config: enable crash_loop_limit by default #13431

Merged
merged 2 commits into from
Sep 19, 2023

Conversation

bharathv
Copy link
Contributor

@bharathv bharathv commented Sep 14, 2023

Defaults to 5. If a broker shuts down uncleanly 5 times back to back, it is considered to be in a crash loop.

Crash loop limit enforcement is disabled in developer mode which is also set by rpk's dev-container mode.
The idea is that crash_loop_limit tracking has value only when running at scale and disabling it prevents inadverdent lockups (and hence better UX) when running in container mode / CI etc.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.2.x
  • v23.1.x
  • v22.3.x

Release Notes

Features

  • crash_loop_limit now defaults to 5. If a broker uncleanly shutdowns for 5 times back to back, it is considered to be in a crash loop mode and Redpanda refuses to start up and may need a manual intervention. This enforcement is disabled in developer mode and rpk's dev-container mode.

dotnwat
dotnwat previously approved these changes Sep 14, 2023
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crash loop limit enforcement is disabled in developer mode which is also set by rpk's dev-container mode.
The idea is that crash_loop_limit tracking has value only when running at scale and disabling it prevents inadverdent lockups (and hence better UX) when running in container mode / CI etc.

Thanks for explaining this context.

Should this be backported? It kinda seems like yes since the same basic problem can happen to older systems as well. cc @piyushredpanda ?

@bharathv
Copy link
Contributor Author

Should this be backported? It kinda seems like yes since the same basic problem can happen to older systems as well

I think so too but Piyush had a different opinion. Piyush, can you please confirm?

Think the ducktape failures are related, please don't merge, I need to take a closer look.

@piyushredpanda
Copy link
Contributor

Yep, I think I'm good to backport actually (tagged y'all in parallel in the slack thread internally :) )

@dotnwat dotnwat self-requested a review September 14, 2023 22:23
@dotnwat dotnwat dismissed their stale review September 14, 2023 22:24

avoiding force merge accident until failures are triaged

This disables it also for dev-container mode in rpk

The idea is that crash_loop_limit tracking has value only when running
at scale and disabling it prevents inadverdent lockups (and hence better
UX) when running in container mode / CI etc.
Defaults to 5. If a broker shuts down uncleanly 5 times back to back,
it is considered to be in a crash loop.
@bharathv
Copy link
Contributor Author

Test failures:

Known issue: #11998

@bharathv
Copy link
Contributor Author

@dotnwat this is ready for review again.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

was wondering if the extra checks of the tracker file existing would be needed if we did

auto limit = developer_mode() ? std::nullopt : config::node().crash_loop_limit.value();

@piyushredpanda piyushredpanda merged commit 3ad3700 into redpanda-data:dev Sep 19, 2023
9 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.1.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-13431-v23.1.x-834 remotes/upstream/v23.1.x
git cherry-pick -x 43749ea62067cf6d72512d144b36794076622cbd 12b5afa1bdbf944d2f68d5d98865fb07760310f3

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants