-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-26468 Region Server doesn't exit cleanly incase it crashes. #3862
Conversation
🎊 +1 overall
This message was automatically generated. |
This looks good actually, just wondering what could be the most destructive action that non-daemons could do. Maybe nothing? because regionserver/master is already (or in the progress of) shutdown? |
🎊 +1 overall
This message was automatically generated. |
It can't do anything since every interaction with hmaster/regionserver will fail. |
If the control reaches in the changed code, then HMaster/HRegionServer thread has already died. |
Yeah that sounds right, server would have already been unresponsive. Also, nothing happens after call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, pending QA
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, lgtm @shahrs87
The comment of the method says it will only call System.exit when a nonzero return code is returned from the run method? |
@Apache9 good catch. Changed the comment. Please review again. |
🎊 +1 overall
This message was automatically generated. |
What I mean is that the method is design to be like the old behavior, so we need to make sure that it is OK for us to change the behavior. We could fix a bug introduced by the old behavior but we may introduce more bugs if we change the behavior... Of course I'm not telling that we can not apply the patch, but we need to take more care. Thanks. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@Apache9 Agree with what you are saying. I don't see how this patch introduces any more bugs. All the tests also succeeded. Do you see any concern in this patch ? |
The code is pretty old... This is the commit, a refactoring actually, and there is a comment So I think we do not call System.exit when ret is zero is intentional. A zero return value means a normal shutdown, so here we want all the threads clean themselves up and then exit gracefully. The problem here is that, why a crashed region server could return zero here? Why it does not return a non zero value? Sorry for sticking on the behavior here, the modification is pretty simple but it is very foundamental... Thanks. |
@Apache9 Regionserver doesn't crash, it exits normally. It prints log: |
Also, now I see that @shahrs87 has also updated the description of the Jira HBASE-26468 and explained what exactly goes wrong and why server stays frozen if one of the non-daemon threads is not terminated. |
I have added all the details in HBASE-26468 but adding here for the context. Within RS (atleast in branch-1.6) there is a subtle difference when RS is aborted vs when it is stopped. In this particular case, the RS was stopped. In HRegionServerCommandLine, if a server is aborted, then it throws RTE which in turns converts into System.exit(1). But if RS is stopped, it returns return code of 0. Sure we can make changes to this code to return -1 when RS is stopped so it behaves exactly alike when it is aborted. But there will be bugs in future when RS will exit with return code 0 when it should have return with non zero exit code.
We can always wait for few seconds for threads to clean themselves up if RS exits with 0 return code. But we shouldn't wait infinitely for threads to clean themselves up.
Don't be sorry. It is a good discussion and we should always try to understand the implications before we make any fundamental changes. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, the approach looks good overall. Nice one! And thanks to you too @Apache9 for the suggestion
hbase-server/src/main/java/org/apache/hadoop/hbase/util/ServerCommandLine.java
Outdated
Show resolved
Hide resolved
Thread.sleep(1000); | ||
} | ||
if (forceStop) { | ||
LOG.error("Failed to stop all non-daemon threads, so force quitting"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WARN or FATAL log level might be more suitable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process has already stopped/aborted so I don't want to log it at FATAL. So I thought ERROR is more appropriate. Kept it at ERROR. Please let me know if you feel v strong against ERROR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERROR seems good 👍
How about Failed to stop all non-daemon threads, so terminating the regionserver JVM
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed regionserver word from above message since this is common to HMaster also.
// Exclude current thread | ||
if (t.getId() != Thread.currentThread().getId() && !t.isDaemon()) { | ||
nonDaemonThreadCount.getAndIncrement(); | ||
LOG.info("Non daemon thread name: " + t.getName() + " still alive"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use placeholder {}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be good if we could print the stack trace here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please review again.
Sample log lines after deploying this code to internal test cluster
|
if (t.getId() != Thread.currentThread().getId() && !t.isDaemon()) { | ||
nonDaemonThreadCount.getAndIncrement(); | ||
LOG.info("Non daemon thread {} is still alive", t.getName()); | ||
LOG.info(printStackTrace(t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the sample stack trace after deploying it to internal cluster
2021-11-26 19:21:16,747 INFO [main] util.Threads - Non daemon thread main.named-queue-events-pool-pool1-t1 is still alive
2021-11-26 19:21:16,748 INFO [main] util.Threads -
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2044)
com.lmax.disruptor.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:45)
com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:124)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
There is no easy method to print stack trace given a thread handle. Thread#dumpStack
dumps stack of current thread but we don't need that.
Cc @virajjasani @Apache9
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@virajjasani @Apache9 Can you please review the patch again. I tested the change on internal test cluster and it works as expected. Thank you ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left 2 minor nits, +1 overall
sb.append("\n"); | ||
sb.append(" " + frame.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sb.append("\n").append(" ").append(frame.toString())
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
public static String printStackTrace(Thread t) { | ||
StringBuilder sb = new StringBuilder(); | ||
for (StackTraceElement frame: t.getStackTrace()) { | ||
sb.append("\n").append(" " + frame.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: append(" ").append(frame.toString())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept multiple spaces so as to make it look like stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no not that, I meant adding frame.toString()
with .append()
rather than String concat. But it's minor nit only. Maybe you can make this change (or not) and also keep backport PRs (#3865 #3866) up-to date, and all can be merged together tomorrow.
(I will take care of branch-2.4 backport from branch-2 backport PR)
🎊 +1 overall
This message was automatically generated. |
@virajjasani Updated the current PR and backports for branch-2 and branch-1 |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
) (#3862) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Geoffrey Jacoby <[email protected]> Signed-off-by: Viraj Jasani <[email protected]>
) (#3862) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Geoffrey Jacoby <[email protected]> Signed-off-by: Viraj Jasani <[email protected]>
) (#3862) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Geoffrey Jacoby <[email protected]> Signed-off-by: Viraj Jasani <[email protected]>
…ache#3865) (apache#3862) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Geoffrey Jacoby <[email protected]> Signed-off-by: Viraj Jasani <[email protected]>
No description provided.