-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid bubbling up failures from a shard that is recovering #42287
Avoid bubbling up failures from a shard that is recovering #42287
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good (NB earlier versions of Lucene didn't throw an INFE here, so you may see FileNotFoundException
in older versions). However I think we should have a test for this change in behaviour.
thanks @dnhatn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment but LGTM.
return storeFilesMetaData; | ||
} catch (org.apache.lucene.index.IndexNotFoundException e) { | ||
logger.trace(new ParameterizedMessage("[{}] node is missing index, responding with empty", shardId), e); | ||
return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only ignore the exception if shard is recovering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to only have this try-catch in the no-engine case, since it should not happen if there is an engine. The same pattern is also done in 2 other places already, so could be nice to add a single method to handle this in the same way whenever needed.
I see no problems in doing this always in this specific case, so please regard this optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave this to a follow-up to explore. I'm generally not happy with how we access the store on a shard that is recovering. Also accessing last commit instead of safe commit feels wrong as well for some of the callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
assertAcked(client().admin().indices().prepareUpdateSettings(indexName).setSettings(Settings.builder() | ||
.put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 2) | ||
.putNull("index.routing.allocation.include._name"))); | ||
try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this try should be moved to line 938 so that we also release the latch if this test fails in other unexpected ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, fixed in 2a52487
return storeFilesMetaData; | ||
} catch (org.apache.lucene.index.IndexNotFoundException e) { | ||
logger.trace(new ParameterizedMessage("[{}] node is missing index, responding with empty", shardId), e); | ||
return new StoreFilesMetaData(shardId, Store.MetadataSnapshot.EMPTY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to only have this try-catch in the no-engine case, since it should not happen if there is an engine. The same pattern is also done in 2 other places already, so could be nice to add a single method to handle this in the same way whenever needed.
I see no problems in doing this always in this specific case, so please regard this optional.
A shard that is undergoing peer recovery is subject to logging warnings of the form org.elasticsearch.action.FailedNodeException: Failed node [XYZ] ... Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ... These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet). As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings. Closes #40107
A shard that is undergoing peer recovery is subject to logging warnings of the form org.elasticsearch.action.FailedNodeException: Failed node [XYZ] ... Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ... These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet). As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings. Closes #40107
…2287) A shard that is undergoing peer recovery is subject to logging warnings of the form org.elasticsearch.action.FailedNodeException: Failed node [XYZ] ... Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ... These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet). As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings. Closes elastic#40107
Hi, I see from logs Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in store(MMapDirectory@/elasticsearch/data/nodes/0/indices/ZIqXS2f-SWKLvR40ocIg4Q/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2cc763ad): files: [recovery.cxml8-lOTIWRft_jB0iZ3A._j.dii, recovery.cxml8-lOTIWRft_jB0iZ3A._j.fdx, recovery.cxml8-lOTIWRft_jB0iZ3A._j_1.liv, recovery.cxml8-lOTIWRft_jB0iZ3A._p.si, recovery.cxml8-lOTIWRft_jB0iZ3A._s.si, recovery.cxml8-lOTIWRft_jB0iZ3A._u.si, write.lock] The fix above is needed to address this issue? Thanks for the help. Regards |
Elasticsearch Version: 6.2.4 |
Yes @jala-dx. The message is misleading, in fact we sometimes expect there to be no |
A shard that is undergoing peer recovery is subject to logging warnings of the form
(see e.g. #30919)
These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper
IndexCommit
just yet).As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings.
Closes #40107