-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Fix NPE in ReplicaShardAllocator (#13993) #14385
Merged
dblock
merged 3 commits into
opensearch-project:main
from
DaniilRoman:fix-npe-in-replicashardallocator
Jul 18, 2024
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaniilRoman do we know why this assertion had to be removed. @dblock In general I am not of fan of removing assertions, they exist in the system for a reason, we shouldn't let our guards down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bukhtawar the assertion has been replaced by an explicit null check, returning
null
instead of throwing an exception. Either way the code will not proceed past this point. There is a previousreturn null
at line 98, so it seems null is already an expected return value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbwiddis If I am understanding this correct there are two things
cancelExistingRecoveryForBetterMatch
is expected to return aRunnable
to denote some work to cancel existing recovery. Anull
as a method return type denotes no work is supposed to be done for cancelling.primary
should be non-null, since the code path is supposed to kick in for aninitialising replica
copy which can only be initialising once it's correspondingprimary
has been assigned see the deciderSo there are two different things that exist and equating them IMO is incorrect. Removing the assertion essentially says at this point of execution a primary shard can be null when the system was designed ensuring that it shouldn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't dug into the logic, but from the face of it, the previous situation is that the assertion causes an exception, and the exception isn't handled.
I've no objection to reverting this change if you think the exception should be caught and addressed elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately assertions don't run outside tests and the stack trace isn't an assertion tripping. Even if an assertion is failing the right thing to do is investigate why it is failing rather than burying the problem, by removing the assertion(invariant) that was added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I see looking again... sorry. The first time I was just looking for a line number. But now I'm really confused. In order for an NPE on line 108 on null
primaryShard
it would have had to pass the assertions on 106 and 107.OpenSearch/server/src/main/java/org/opensearch/gateway/ReplicaShardAllocator.java
Lines 106 to 108 in 12115d1
Is this just undefined/unpredictable behavior during JVM shutdown (node drop)?
In any case, seems a revert of this PR is the right action at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't think this is the correct understanding of how
assert
works. AssertionErrors are only configured for tests using a-ea
JVM flag and in this case I don't think anAssertionError
was triggered. Please feel free to read up more on this hereThanks for your understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, and thanks for educating me.
So wouldn't an appropriate fix for the NPE then to both put the
assert
back and null check to avoid the exception? That seems odd but I've never really used assertions like this so at this point I really don't know the best approach.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbwiddis The appropriate fix is to figure out why the invariant is being violated and fix that bug. i.e. if no primary exists, then there can be no recovery and this method should not be called. The previous behavior of having the code fail with an NPE is preferable to ignoring a situation that is not intended to be possible.
Slightly off topic, but in general I really dislike using the
assert
keyword. Instead ofassert primaryShard != null : "message"
I almost always prefer something likeObjects.requireNonNull(primaryShard, "message")
so that it deterministically fails in all situations. Only if the code is super performance critical and if the assertion check is computationally intensive do I think theassert
keyword makes sense.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrross In general I agree.
In reality, I read the issue reporting this bug and it indicates it's happening during a node drop. Which means the JVM is shutting down for <insert reason here>. Which means undefined/unpredictable behavior.
TLDR: I'm not sure there is a bug, and the symptom may be a result of a JVM being shut down causing a node drop as a result of some totally unrelated thing.
There's not enough information in the issue to infer more.