-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart #47276
Comments
Pinging @elastic/es-distributed |
Hi @redbaron4. Thanks for reporting this. A few questions:
|
I have now managed to reproduce this with data-only nodes:
The same should not happen with master-eligible data nodes though. Can you clarify that point for us? If this only affects data-only nodes we might be able to provide instructions on how to get the node running again, without losing data. |
@ywelsch Thanks for looking at this
All data nodes (the kind which show failure) have similar config to one given below
The master elgible nodes are
|
Ok, this confirms my findings. First of all, this is a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices. On non-master-eligible data nodes, it's possible for the node's In the meanwhile, the following workaround is applicable to get the node running again. This workaround should not lead to any data loss. However, great care must be taken before applying it, preferably backing up the data folder on the node before undertaking the following low-level surgery:
|
Thanks for the work around. I'll try it the next time we face this situation. I almost had the impulse of removing the manifest file but did not do it. I tried to remove offending index entries from the manifest file first which led to consistency checks failure. So I restored the manifest file and desisted from any more tinkerings with it :) I hope the bug gets fixed soon. |
…-only node (#47285) Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices (but incorrectly handles replicated closed indices). On non-master-eligible data nodes, it's possible for the node's manifest file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. Closes #47276
…-only node (#47285) Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices (but incorrectly handles replicated closed indices). On non-master-eligible data nodes, it's possible for the node's manifest file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. Closes #47276
…-only node (#47285) Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices (but incorrectly handles replicated closed indices). On non-master-eligible data nodes, it's possible for the node's manifest file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. Closes #47276
Elasticsearch version: 7.3.2
Plugins installed: []
JVM version (
java -version
): 1.8.0OS version: Centos-7.4
We have been running an elasticsearch cluster consisting of 5 modes for quite some time now. After upgrade to v7, we have noticed a lot of times our nodes refuse to start with
an error
nested: IOException[failed to find metadata for existing index XXX
.The first time I encountered this error, I searched the discuss board and found this which talks of stronger startup checks enforced by ES-7.x and points to data directory getting corrupted due to external factors. Thinking it may be the same probloem, I duly took the node offline and ran a disk check which reported no errors. So I deleted the data directory, started the node and that was that.
However, the next time I did a rolling upgrade of my cluster, a different node failed with a similar error (The index name was different). I followed the same emergency procedure (delete data directory and restart node) and cluster was fixed.
Now after every rolling upgrade I seem to run into this error with atleast one of my node. The index name always points to a closed index. The error occurs only on restart (never while elasticsearch is running).
I find it hard to believe that all 5 of my nodes have a disk problem because:
fsck
everytime this error has occurred and no errors have been reported.Yesterday we had a power issue at the data-center which led to all nodes getting power cycled. Upon restart 4 out of 5 modes failed to start with same errors. On all 4 nodes, the names of indexes was different (The indexes in question were "closed"). I had no option but to delete all data on those 4 nodes (Thus losing about 80% of elasticsearch data).
The errors seen were
and
Is it possible that data of closed indexes is not being persisted properly (leading to issues at restart)? Can this be mitigated somehow (Maybe rolling back to less stronger consistency checks)?
The text was updated successfully, but these errors were encountered: