Avoid bubbling up failures from a shard that is recovering #42287

ywelsch · 2019-05-21T12:24:33Z

A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

(see e.g. #30919)

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings.

Closes #40107

elasticmachine · 2019-05-21T12:24:36Z

Pinging @elastic/es-distributed

DaveCTurner

Looks good (NB earlier versions of Lucene didn't throw an INFE here, so you may see FileNotFoundException in older versions). However I think we should have a test for this change in behaviour.

dnhatn · 2019-05-21T15:31:07Z

This PR should fix #40107 (hence we should treat this as a bug). I had a test for #40107, but I lost it. I will reconstruct and add that test here.

ywelsch · 2019-05-21T16:50:52Z

thanks @dnhatn

dnhatn

I left a comment but LGTM.

dnhatn · 2019-05-21T20:47:45Z

server/src/main/java/org/elasticsearch/indices/store/TransportNodesListShardStoreMetaData.java

+                        return storeFilesMetaData;
+                    } catch (org.apache.lucene.index.IndexNotFoundException e) {
+                        logger.trace(new ParameterizedMessage("[{}] node is missing index, responding with empty", shardId), e);
+                        return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);


Should we only ignore the exception if shard is recovering?

Would be nice to only have this try-catch in the no-engine case, since it should not happen if there is an engine. The same pattern is also done in 2 other places already, so could be nice to add a single method to handle this in the same way whenever needed.
I see no problems in doing this always in this specific case, so please regard this optional.

I will leave this to a follow-up to explore. I'm generally not happy with how we access the store on a shard that is recovering. Also accessing last commit instead of safe commit feels wrong as well for some of the callers.

henningandersen

LGTM.

henningandersen · 2019-05-22T06:08:35Z

server/src/test/java/org/elasticsearch/indices/recovery/IndexRecoveryIT.java

+        assertAcked(client().admin().indices().prepareUpdateSettings(indexName).setSettings(Settings.builder()
+            .put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 2)
+            .putNull("index.routing.allocation.include._name")));
+        try {


I think this try should be moved to line 938 so that we also release the latch if this test fails in other unexpected ways.

agree, fixed in 2a52487

henningandersen · 2019-05-22T06:20:54Z

server/src/main/java/org/elasticsearch/indices/store/TransportNodesListShardStoreMetaData.java

+                        return storeFilesMetaData;
+                    } catch (org.apache.lucene.index.IndexNotFoundException e) {
+                        logger.trace(new ParameterizedMessage("[{}] node is missing index, responding with empty", shardId), e);
+                        return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY);


Would be nice to only have this try-catch in the no-engine case, since it should not happen if there is an engine. The same pattern is also done in 2 other places already, so could be nice to add a single method to handle this in the same way whenever needed.
I see no problems in doing this always in this specific case, so please regard this optional.

…ception-logging

A shard that is undergoing peer recovery is subject to logging warnings of the form org.elasticsearch.action.FailedNodeException: Failed node [XYZ] ... Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ... These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet). As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings. Closes #40107

…2287) A shard that is undergoing peer recovery is subject to logging warnings of the form org.elasticsearch.action.FailedNodeException: Failed node [XYZ] ... Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ... These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet). As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings. Closes elastic#40107

jala-dx · 2019-08-08T06:12:45Z

Hi, I see from logs

Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(MMapDirectory@/elasticsearch/data/nodes/0/indices/ZIqXS2f-SWKLvR40ocIg4Q/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2cc763ad): files: [recovery.cxml8-lOTIWRft_jB0iZ3A._j.dii, recovery.cxml8-lOTIWRft_jB0iZ3A._j.fdx, recovery.cxml8-lOTIWRft_jB0iZ3A._j_1.liv, recovery.cxml8-lOTIWRft_jB0iZ3A._p.si, recovery.cxml8-lOTIWRft_jB0iZ3A._s.si, recovery.cxml8-lOTIWRft_jB0iZ3A._u.si, write.lock]

The fix above is needed to address this issue?

Thanks for the help.

Regards
Jala

jala-dx · 2019-08-08T06:44:03Z

Elasticsearch Version: 6.2.4

DaveCTurner · 2019-08-08T09:08:37Z

Yes @jala-dx. The message is misleading, in fact we sometimes expect there to be no segments* file, so Elasticsearch is wrong to warn you about this. The fix is not to log this message, which is implemented in versions ≥7.2.0 and also versions ≥6.8.1 and <7.0.0. The workaround in other versions is for you to ignore this message.

Avoid logging failures on a shard that is recovering

d401985

ywelsch added >non-issue :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.2.0 v6.7.3 v7.1.1 labels May 21, 2019

ywelsch requested review from DaveCTurner and henningandersen May 21, 2019 12:24

DaveCTurner reviewed May 21, 2019

View reviewed changes

ywelsch added >bug and removed >non-issue labels May 21, 2019

ywelsch requested a review from dnhatn May 21, 2019 17:13

add test

f834848

dnhatn approved these changes May 21, 2019

View reviewed changes

Merge branch 'master' into index-not-found-exception-logging

8602bbf

henningandersen approved these changes May 22, 2019

View reviewed changes

ywelsch added 2 commits May 22, 2019 09:41

Merge remote-tracking branch 'elastic/master' into index-not-found-ex…

2a52487

…ception-logging

commit changes

bacf4a4

ywelsch merged commit 3b67d87 into elastic:master May 22, 2019

ywelsch added v6.8.1 and removed v7.1.1 v6.7.3 labels May 22, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid bubbling up failures from a shard that is recovering #42287

Avoid bubbling up failures from a shard that is recovering #42287

ywelsch commented May 21, 2019 •

edited by dnhatn

Loading

elasticmachine commented May 21, 2019

DaveCTurner left a comment

dnhatn commented May 21, 2019

ywelsch commented May 21, 2019

dnhatn left a comment

dnhatn May 21, 2019

henningandersen May 22, 2019

ywelsch May 22, 2019

henningandersen left a comment

henningandersen May 22, 2019

ywelsch May 22, 2019

henningandersen May 22, 2019

jala-dx commented Aug 8, 2019

jala-dx commented Aug 8, 2019

DaveCTurner commented Aug 8, 2019

Avoid bubbling up failures from a shard that is recovering #42287

Avoid bubbling up failures from a shard that is recovering #42287

Conversation

ywelsch commented May 21, 2019 • edited by dnhatn Loading

elasticmachine commented May 21, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

dnhatn commented May 21, 2019

ywelsch commented May 21, 2019

dnhatn left a comment

Choose a reason for hiding this comment

dnhatn May 21, 2019

Choose a reason for hiding this comment

henningandersen May 22, 2019

Choose a reason for hiding this comment

ywelsch May 22, 2019

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

henningandersen May 22, 2019

Choose a reason for hiding this comment

ywelsch May 22, 2019

Choose a reason for hiding this comment

henningandersen May 22, 2019

Choose a reason for hiding this comment

jala-dx commented Aug 8, 2019

jala-dx commented Aug 8, 2019

DaveCTurner commented Aug 8, 2019

ywelsch commented May 21, 2019 •

edited by dnhatn

Loading