Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-24286: HMaster won't become healthy after after cloning or crea… #2114

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

taklwu
Copy link
Contributor

@taklwu taklwu commented Jul 21, 2020

…ting a new cluster pointing at the same file system

HBase currently does not handle Unknown Servers automatically and requires
users to run hbck2 scheduleRecoveries when one see unknown servers on
the HBase report UI.

This became a blocker on HBase2 adoption especially when a table wasn't
disabled before shutting down a HBase cluster on cloud or any dynamic
environment that hostname may change frequently. Once the cluster restarts,
hbase:meta will be keeping the old hostname/IPs for the previous cluster,
and those region servers became Unknown Servers and will never be recycled.

Our fix here is to trigger a repair immediately after the CatalogJanitor
figured out any Unknown Servers with submitting a HBCKServerCrashProcedure
such that regions on Unknown Server can be reassigned to other online
servers.

…ting a new cluster pointing at the same file system

HBase currently does not handle `Unknown Servers` automatically and requires
users to run hbck2 `scheduleRecoveries` when one see unknown servers on
the HBase report UI.

This became a blocker on HBase2 adoption especially when a table wasn't
disabled before shutting down a HBase cluster on cloud or any dynamic
environment that hostname may change frequently. Once the cluster restarts,
hbase:meta will be keeping the old hostname/IPs for the previous cluster,
and those region servers became `Unknown Servers` and will never be recycled.

Our fix here is to trigger a repair immediately after the CatalogJanitor
figured out any `Unknown Servers` with submitting a HBCKServerCrashProcedure
such that regions on `Unknown Server ` can be reassigned to other online
servers.
Copy link
Member

@joshelser joshelser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test to accompany this.

While I think the CatalogJanitor approach is probably an effective solution, I wonder if there's a "faster" solution we could do.

The main question is, when we don't have ZooKeeper telling us that a RegionServer has died, how can we be certain that a RegionServer won't "come back"? If we get into a situation where data was still hosted on a RegionServer we thought was dead, we would double-assign the region and that'd be a big bug.

Any thoughts on how to try to minimize the chance of us incorrectly marking a RegionServer as dead?

@@ -173,6 +175,8 @@ int scan() throws IOException {
this.lastReport = scanForReport();
if (!this.lastReport.isEmpty()) {
LOG.warn(this.lastReport.toString());
// expires unknown servers
repairUnknownServers();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess not an issue for master (which doesn't have the separate namespace table), but elsewhere do we still have hbase.master.namespace.init.timeout setting an upper-bound on how long we wait for hbase:namespace to get assigned? Thinking that, waiting for CatalogJanitor to run, will be a pretty "slow" solution (up to 5min wait), and we may have a master crash if the ns init timeout is 5 mins as well as the catalog janitor's interval.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 36s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+1 💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
_ master Compile Tests _
+1 💚 mvninstall 3m 50s master passed
+1 💚 checkstyle 1m 6s master passed
+1 💚 spotbugs 2m 5s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 28s the patch passed
+1 💚 checkstyle 1m 5s the patch passed
+1 💚 whitespace 0m 0s The patch has no whitespace issues.
+1 💚 hadoopcheck 11m 20s Patch does not cause any errors with Hadoop 3.1.2 3.2.1.
+1 💚 spotbugs 2m 48s the patch passed
_ Other Tests _
+1 💚 asflicense 0m 14s The patch does not generate ASF License warnings.
35m 41s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/artifact/yetus-general-check/output/Dockerfile
GITHUB PR #2114
JIRA Issue HBASE-24286
Optional Tests dupname asflicense spotbugs hadoopcheck hbaseanti checkstyle
uname Linux b0c22b5b487c 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f35c5ea
Max. process+thread count 94 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f) spotbugs=3.1.12
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 37s Docker mode activated.
-0 ⚠️ yetus 0m 3s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 3m 52s master passed
+1 💚 compile 0m 58s master passed
+1 💚 shadedjars 5m 35s branch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 37s master passed
_ Patch Compile Tests _
+1 💚 mvninstall 3m 30s the patch passed
+1 💚 compile 0m 55s the patch passed
+1 💚 javac 0m 55s the patch passed
+1 💚 shadedjars 5m 32s patch has no errors when building our shaded downstream artifacts.
+1 💚 javadoc 0m 38s the patch passed
_ Other Tests _
-1 ❌ unit 143m 31s hbase-server in the patch failed.
167m 59s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/artifact/yetus-jdk8-hadoop3-check/output/Dockerfile
GITHUB PR #2114
JIRA Issue HBASE-24286
Optional Tests javac javadoc unit shadedjars compile
uname Linux 68b843a4f0a7 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f35c5ea
Default Java 1.8.0_232
unit https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/artifact/yetus-jdk8-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/testReport/
Max. process+thread count 4659 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 1m 18s Docker mode activated.
-0 ⚠️ yetus 0m 2s Unprocessed flag(s): --brief-report-file --spotbugs-strict-precheck --whitespace-eol-ignore-list --whitespace-tabs-ignore-list --quick-hadoopcheck
_ Prechecks _
_ master Compile Tests _
+1 💚 mvninstall 4m 50s master passed
+1 💚 compile 1m 19s master passed
+1 💚 shadedjars 7m 6s branch has no errors when building our shaded downstream artifacts.
-0 ⚠️ javadoc 0m 58s hbase-server in master failed.
_ Patch Compile Tests _
+1 💚 mvninstall 5m 10s the patch passed
+1 💚 compile 1m 29s the patch passed
+1 💚 javac 1m 29s the patch passed
+1 💚 shadedjars 6m 51s patch has no errors when building our shaded downstream artifacts.
-0 ⚠️ javadoc 0m 44s hbase-server in the patch failed.
_ Other Tests _
-1 ❌ unit 206m 34s hbase-server in the patch failed.
238m 7s
Subsystem Report/Notes
Docker Client=19.03.12 Server=19.03.12 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/artifact/yetus-jdk11-hadoop3-check/output/Dockerfile
GITHUB PR #2114
JIRA Issue HBASE-24286
Optional Tests javac javadoc unit shadedjars compile
uname Linux 0742f5feebba 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/hbase-personality.sh
git revision master / f35c5ea
Default Java 2020-01-14
javadoc https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/artifact/yetus-jdk11-hadoop3-check/output/branch-javadoc-hbase-server.txt
javadoc https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/artifact/yetus-jdk11-hadoop3-check/output/patch-javadoc-hbase-server.txt
unit https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/artifact/yetus-jdk11-hadoop3-check/output/patch-unit-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/testReport/
Max. process+thread count 3856 (vs. ulimit of 12500)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-2114/1/console
versions git=2.17.1 maven=(cecedd343002696d0abb50b32b541b8a6ba2883f)
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@petersomogyi
Copy link
Contributor

I've made some testing recently using this patch to be able to start an HBase cluster on a pre-existing HBase root directory. Currently I have to use HBCK2 recoverUnknown or SCPs but this automates the startup procedure. Based on my testing the patch works well and HBase successfully reassign the regions that are present in hbase:meta table with different hostnames (a.k.a unknown servers).

Since for some installations it might not be required to do this I'm suggesting to hide this behind a feature flag.

@joshelser
Copy link
Member

I'm suggesting to hide this behind a feature flag

Makes sense to me. I think that addresses some of the other concerns from @Apache9 (mentioning him to make sure that's OK with him).

If @taklwu is OK with it (and can grant you edit perms), maybe you can update this PR with your changes? Or, close this and open a new one with your modifications.

@z-york
Copy link
Contributor

z-york commented Jun 21, 2021

@petersomogyi It's probably worth pinging on #2113 and subsequent PRs as that's where most of the conversation happened. I thought Stephen had offered to put this behind a config before, but maybe I'm mistaken. Anyways, it's worth revisting this issue anyways IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants