Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid bubbling up failures from a shard that is recovering #42287

Merged
merged 5 commits into from
May 22, 2019

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented May 21, 2019

A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

(see e.g. #30919)

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings.

Closes #40107

@ywelsch ywelsch added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.2.0 v6.7.3 v7.1.1 labels May 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good (NB earlier versions of Lucene didn't throw an INFE here, so you may see FileNotFoundException in older versions). However I think we should have a test for this change in behaviour.

@dnhatn
Copy link
Member

dnhatn commented May 21, 2019

This PR should fix #40107 (hence we should treat this as a bug). I had a test for #40107, but I lost it. I will reconstruct and add that test here.

@ywelsch ywelsch added >bug and removed >non-issue labels May 21, 2019
@ywelsch
Copy link
Contributor Author

ywelsch commented May 21, 2019

thanks @dnhatn

@ywelsch ywelsch requested a review from dnhatn May 21, 2019 17:13
Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment but LGTM.

return storeFilesMetaData;
} catch (org.apache.lucene.index.IndexNotFoundException e) {
logger.trace(new ParameterizedMessage("[{}] node is missing index, responding with empty", shardId), e);
return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only ignore the exception if shard is recovering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to only have this try-catch in the no-engine case, since it should not happen if there is an engine. The same pattern is also done in 2 other places already, so could be nice to add a single method to handle this in the same way whenever needed.
I see no problems in doing this always in this specific case, so please regard this optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave this to a follow-up to explore. I'm generally not happy with how we access the store on a shard that is recovering. Also accessing last commit instead of safe commit feels wrong as well for some of the callers.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

assertAcked(client().admin().indices().prepareUpdateSettings(indexName).setSettings(Settings.builder()
.put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 2)
.putNull("index.routing.allocation.include._name")));
try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this try should be moved to line 938 so that we also release the latch if this test fails in other unexpected ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, fixed in 2a52487

return storeFilesMetaData;
} catch (org.apache.lucene.index.IndexNotFoundException e) {
logger.trace(new ParameterizedMessage("[{}] node is missing index, responding with empty", shardId), e);
return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to only have this try-catch in the no-engine case, since it should not happen if there is an engine. The same pattern is also done in 2 other places already, so could be nice to add a single method to handle this in the same way whenever needed.
I see no problems in doing this always in this specific case, so please regard this optional.

@ywelsch ywelsch merged commit 3b67d87 into elastic:master May 22, 2019
ywelsch added a commit that referenced this pull request May 22, 2019
A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e.
there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and
confusion amongst users due to being logged as warnings.

Closes  #40107
ywelsch added a commit that referenced this pull request May 22, 2019
A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e.
there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and
confusion amongst users due to being logged as warnings.

Closes  #40107
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this pull request May 27, 2019
…2287)

A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e.
there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and
confusion amongst users due to being logged as warnings.

Closes  elastic#40107
@jala-dx
Copy link

jala-dx commented Aug 8, 2019

Hi, I see from logs

Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(MMapDirectory@/elasticsearch/data/nodes/0/indices/ZIqXS2f-SWKLvR40ocIg4Q/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2cc763ad): files: [recovery.cxml8-lOTIWRft_jB0iZ3A._j.dii, recovery.cxml8-lOTIWRft_jB0iZ3A._j.fdx, recovery.cxml8-lOTIWRft_jB0iZ3A._j_1.liv, recovery.cxml8-lOTIWRft_jB0iZ3A._p.si, recovery.cxml8-lOTIWRft_jB0iZ3A._s.si, recovery.cxml8-lOTIWRft_jB0iZ3A._u.si, write.lock]

The fix above is needed to address this issue?

Thanks for the help.

Regards
Jala

@jala-dx
Copy link

jala-dx commented Aug 8, 2019

Elasticsearch Version: 6.2.4

@DaveCTurner
Copy link
Contributor

Yes @jala-dx. The message is misleading, in fact we sometimes expect there to be no segments* file, so Elasticsearch is wrong to warn you about this. The fix is not to log this message, which is implemented in versions ≥7.2.0 and also versions ≥6.8.1 and <7.0.0. The workaround in other versions is for you to ignore this message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v6.8.1 v7.2.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New master repeatedly reroute and fetch shard store of recovering replica
7 participants