-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Die with dignity while merging #27265
Conversation
Do not be thrown off by the size of the diff here, most of this is moving code to a new class called |
If an out of memory error is thrown while merging, today we quietly rewrap it into a merge exception and the out of memory error is lost. Instead, we need to rethrow out of memory errors, and in fact any fatal error here, and let those go uncaught so that the node is torn down. This commit causes this to be the case.
@@ -1925,14 +1917,34 @@ public void onFailure(Exception e) { | |||
|
|||
@Override | |||
protected void doRun() throws Exception { | |||
MergePolicy.MergeException e = new MergePolicy.MergeException(exc, dir); | |||
failEngine("merge failed", e); | |||
maybeDie("fatal error while merging", exc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, why maybeDie
in the generic thread instead of calling maybeDie
as the first thing in handleMergeException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I should have left a comment explaining this. The reason is because that will be on the Lucene merge thread where I have no guarantees that the call stack does not contain catch throwable! By moving this to another thread that we have complete control over, we know what the stack contains and know this will go uncaught.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also the Javadoc I had put on maybeDie
that explains that callers must ensure the call stack does not contain catch statements that would lead to the thrown error being caught and never reaching the uncaught exception handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay, that makes sense. Thanks for the explanation!
What happens if there isn't enough memory to create the new thread? Would it be good to do it in both places just to make best-effort to be sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unlikely that a new thread will be created here, instead that we will simply be reusing an existing one from the generic thread pool. However, it still raises the question of what happens if some allocation in the path fails leading to another out of memory exception being thrown. In that case, that one would go uncaught and still lead to the node being torn down, exactly what we are already trying to achieve.
@bleskes Would you please review? |
@@ -1916,7 +1908,6 @@ protected void doRun() throws Exception { | |||
|
|||
@Override | |||
protected void handleMergeException(final Directory dir, final Throwable exc) { | |||
logger.error("failed to merge", exc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shall we keep this here, for fast visibility and also, just in case it happens during node shutdown..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I glanecd at the lucene code leading here and it seems all good. I do wonder if it's good to assert we can't find a Error in the cause chain of exc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the Lucene code as written right now, that's not possible. I'm really not sure if I want to add this assertion right now (look at how complicated it is in Netty4Utils#maybeError
where Netty can hide things we do not want hidden from us). I understand the point of the assertion is to ensure that what we think we see in the Lucene code always holds, but then it's not clear to me if we'd ever encounter a situation where an Error
occurs in testing that would trip one of these assertions anyway. So I see a cost to adding the assertion, and I'm not sure if I see benefit more than that cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point about tests maybe not being enough. I was thinking of something simple like assert ExceptionsHelper.unwrap(exc, Error.class) == null : exc
. the maybeError also looks for suppressions which I'm fine with not doing. Alternatively we can convert maybError to a top level utility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will consider this in a follow-up.
If an out of memory error is thrown while merging, today we quietly rewrap it into a merge exception and the out of memory error is lost. Instead, we need to rethrow out of memory errors, and in fact any fatal error here, and let those go uncaught so that the node is torn down. This commit causes this to be the case. Relates #27265
If an out of memory error is thrown while merging, today we quietly rewrap it into a merge exception and the out of memory error is lost. Instead, we need to rethrow out of memory errors, and in fact any fatal error here, and let those go uncaught so that the node is torn down. This commit causes this to be the case. Relates #27265
If an out of memory error is thrown while merging, today we quietly rewrap it into a merge exception and the out of memory error is lost. Instead, we need to rethrow out of memory errors, and in fact any fatal error here, and let those go uncaught so that the node is torn down. This commit causes this to be the case. Relates #27265
If an out of memory error is thrown while merging, today we quietly rewrap it into a merge exception and the out of memory error is lost. Instead, we need to rethrow out of memory errors, and in fact any fatal error here, and let those go uncaught so that the node is torn down. This commit causes this to be the case.
Relates #19272