-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-24286: HMaster won't become healthy after after cloning or crea… #2114
base: master
Are you sure you want to change the base?
Conversation
…ting a new cluster pointing at the same file system HBase currently does not handle `Unknown Servers` automatically and requires users to run hbck2 `scheduleRecoveries` when one see unknown servers on the HBase report UI. This became a blocker on HBase2 adoption especially when a table wasn't disabled before shutting down a HBase cluster on cloud or any dynamic environment that hostname may change frequently. Once the cluster restarts, hbase:meta will be keeping the old hostname/IPs for the previous cluster, and those region servers became `Unknown Servers` and will never be recycled. Our fix here is to trigger a repair immediately after the CatalogJanitor figured out any `Unknown Servers` with submitting a HBCKServerCrashProcedure such that regions on `Unknown Server ` can be reassigned to other online servers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice test to accompany this.
While I think the CatalogJanitor approach is probably an effective solution, I wonder if there's a "faster" solution we could do.
The main question is, when we don't have ZooKeeper telling us that a RegionServer has died, how can we be certain that a RegionServer won't "come back"? If we get into a situation where data was still hosted on a RegionServer we thought was dead, we would double-assign the region and that'd be a big bug.
Any thoughts on how to try to minimize the chance of us incorrectly marking a RegionServer as dead?
@@ -173,6 +175,8 @@ int scan() throws IOException { | |||
this.lastReport = scanForReport(); | |||
if (!this.lastReport.isEmpty()) { | |||
LOG.warn(this.lastReport.toString()); | |||
// expires unknown servers | |||
repairUnknownServers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess not an issue for master (which doesn't have the separate namespace table), but elsewhere do we still have hbase.master.namespace.init.timeout
setting an upper-bound on how long we wait for hbase:namespace to get assigned? Thinking that, waiting for CatalogJanitor to run, will be a pretty "slow" solution (up to 5min wait), and we may have a master crash if the ns init timeout is 5 mins as well as the catalog janitor's interval.
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
I've made some testing recently using this patch to be able to start an HBase cluster on a pre-existing HBase root directory. Currently I have to use Since for some installations it might not be required to do this I'm suggesting to hide this behind a feature flag. |
Makes sense to me. I think that addresses some of the other concerns from @Apache9 (mentioning him to make sure that's OK with him). If @taklwu is OK with it (and can grant you edit perms), maybe you can update this PR with your changes? Or, close this and open a new one with your modifications. |
@petersomogyi It's probably worth pinging on #2113 and subsequent PRs as that's where most of the conversation happened. I thought Stephen had offered to put this behind a config before, but maybe I'm mistaken. Anyways, it's worth revisting this issue anyways IMO. |
…ting a new cluster pointing at the same file system
HBase currently does not handle
Unknown Servers
automatically and requiresusers to run hbck2
scheduleRecoveries
when one see unknown servers onthe HBase report UI.
This became a blocker on HBase2 adoption especially when a table wasn't
disabled before shutting down a HBase cluster on cloud or any dynamic
environment that hostname may change frequently. Once the cluster restarts,
hbase:meta will be keeping the old hostname/IPs for the previous cluster,
and those region servers became
Unknown Servers
and will never be recycled.Our fix here is to trigger a repair immediately after the CatalogJanitor
figured out any
Unknown Servers
with submitting a HBCKServerCrashProceduresuch that regions on
Unknown Server
can be reassigned to other onlineservers.