-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-22867 The ForkJoinPool in CleanerChore will spawn thousands of threads in our cluster with thousands table #513
Conversation
PR an initial patch, need more consideration.. |
💔 -1 overall
This message was automatically generated. |
task.fork(); | ||
} | ||
allSubdirsDeleted = deleteAction(() -> getCleanResult(tasks), "subdirs"); | ||
subDirs.forEach(dir -> futures.add(pool.execute(new CleanerTask(dir, false)))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think submitting the task to the same pool is not safe here. In ForkJoinPool, it is safe to call join as in the join method, the current thread will try to execute tasks if there are any, so in general, we can always make progress and finally finish all the tasks. But here, the current thread will be blocked on the future.get, so it is possible that all the threads in the pool are blocked on a future.get and cause dead lock here.
Ping @Apache9 , I've rewriten the cleaner , please take a look again. |
Seems still some problem, the UT are easy to hang now, Let me check. |
OK, I got the problem, forget to complete the future somewhere.. it's easy to fix . |
Fixed all the UT in cleanerChore, should be good now. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I think this is a possible way to solve the problem. Using CompletableFuture to get the final result, and in the actions which will be executed in thread pool, just return without wating for the result of the CompletableFuture. But seems we still use the ForkJoinPool? This may still lead to too many threads?
// Step.4: Once all sub-files & sub-directories are deleted, then can try to delete the | ||
// current directory asynchronously. | ||
CompletableFuture.allOf(futures.toArray(new CompletableFuture[futures.size()])) | ||
.whenComplete((voidObj, e) -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use FutureUtils.addListener here. Although we do not run error-prone checks now in pre commit, this will lead to a FutureReturnValueIgnored warning, and it is a problem. If the code in whenComplete throw an exception, the current implementation will hide it and you can only see the program hangs, using FutureUtils.addListener can print out the exception stacktrace at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me update the patch.
@Apache9 |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Ping @Apache9 @Reidddddd any other concerns ? I've made this working in our offline cluster (which encountered this problem before), it seems work pretty good now. |
🎊 +1 overall
This message was automatically generated. |
@@ -213,9 +213,14 @@ private void preRunCleaner() { | |||
|
|||
public Boolean runCleaner() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be boolean here? We will not return null in the future.get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah , it can be.
} catch (Exception ie) { | ||
// Must handle the inner exception here, otherwise the result may get stuck if one | ||
// sub-directory get some failure. | ||
result.completeExceptionally(ie); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in the past we will return false instead of throwing the exception out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Em, let me check whether it will impact the behavior ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've checked the logic, should have no difference here. Both the exeption from sub-directory & boolean flag will skip the deletion of current directory, should be OK.
synchronized void execute(ForkJoinTask<?> task) { | ||
pool.execute(task); | ||
synchronized void runAsync(Runnable runnable) { | ||
CompletableFuture.runAsync(runnable, pool); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm, what is this used for? The method will return a CompletableFuture but we just ignore it? Then what is the difference comparing to just call pool.execute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine, can just use the pool#execute now. plan to use the CompletableFuture of runAsync to get the final result befoe.
LOG.info("Update chore's pool size from {} to {}", pool.getParallelism(), size); | ||
pool = new ForkJoinPool(size); | ||
LOG.info("Update chore's pool size from {} to {}", pool.getPoolSize(), size); | ||
pool = initializePool(size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ThreadPoolExecutor, there is a setCorePoolSize method so we do not need to recreate it. I think we can make it as a final member.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let me address this in the new patch, good suggestion.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
1483af2
to
a3b3826
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No big problems, only a few small questions. Will give a +1 after the QA is fine.
LOG.info("Cleaner pool size is {}", size); | ||
cleanerLatch = 0; | ||
} | ||
|
||
private static ThreadPoolExecutor initializePool(int size) { | ||
return new ThreadPoolExecutor(size, size, 500, TimeUnit.SECONDS, new LinkedBlockingQueue<>(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually it will be 1 minute? Why 500 seconds? And I think we can call allowCoreThreadTimeOut(true) here to release the threads if not needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, did not consider much about the keepAliveTime (can make it 1min if neccessary). the allowCoreThreadTimeOut sounds good.
💔 -1 overall
This message was automatically generated. |
The hbase QA seems is unrelated , but will try the HBase QA again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
} | ||
}); | ||
} catch (Exception e) { | ||
LOG.warn("Failed to traverse and delete the path: {}", dir, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOG.warn here may be very chatty. Debug or Trace is up to your choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here can be DEBUG because we will put the exception in result, and will get it in runCleaner method. Let me fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One trivial comment about Logging, LTGM overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One trivial comment about Logging, LTGM overall.
…threads in our cluster with thousands table
…threads in our cluster with thousands table (#513) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]>
…threads in our cluster with thousands table (#513) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]>
…threads in our cluster with thousands table (#513) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]>
💔 -1 overall
This message was automatically generated. |
…threads in our cluster with thousands table (apache#513) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Reid Chan <[email protected]> (cherry picked from commit 0690864) Change-Id: Ic4fcf6c8c5d585da2f65e881bee500149a3c1ace
…threads in our cluster with thousands table