Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart #47276

Closed
redbaron4 opened this issue Sep 30, 2019 · 6 comments · Fixed by #47285
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection.

Comments

@redbaron4
Copy link

redbaron4 commented Sep 30, 2019

Elasticsearch version: 7.3.2

Plugins installed: []

JVM version (java -version): 1.8.0

OS version: Centos-7.4

We have been running an elasticsearch cluster consisting of 5 modes for quite some time now. After upgrade to v7, we have noticed a lot of times our nodes refuse to start with
an error nested: IOException[failed to find metadata for existing index XXX.

The first time I encountered this error, I searched the discuss board and found this which talks of stronger startup checks enforced by ES-7.x and points to data directory getting corrupted due to external factors. Thinking it may be the same probloem, I duly took the node offline and ran a disk check which reported no errors. So I deleted the data directory, started the node and that was that.

However, the next time I did a rolling upgrade of my cluster, a different node failed with a similar error (The index name was different). I followed the same emergency procedure (delete data directory and restart node) and cluster was fixed.

Now after every rolling upgrade I seem to run into this error with atleast one of my node. The index name always points to a closed index. The error occurs only on restart (never while elasticsearch is running).

I find it hard to believe that all 5 of my nodes have a disk problem because:

  • I have run fsck everytime this error has occurred and no errors have been reported.
  • Elasticsearch runs without a problem for days on end (A disk error or other programs corrupting the data would cause running elasticsearch to crash as had happended on one of my nodes about a year back).

Yesterday we had a power issue at the data-center which led to all nodes getting power cycled. Upon restart 4 out of 5 modes failed to start with same errors. On all 4 nodes, the names of indexes was different (The indexes in question were "closed"). I had no option but to delete all data on those 4 nodes (Thus losing about 80% of elasticsearch data).

The errors seen were

[2019-09-30T10:36:58,205][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [esnode3] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index ssl-2019.09.20 [location: GmslGWkHTLGQowmMHFut7A, generation: 11]];
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
Caused by: java.io.IOException: failed to find metadata for existing index ssl-2019.09.20 [location: GmslGWkHTLGQowmMHFut7A, generation: 11]
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.upgradeMetaData(GatewayMetaState.java:141) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:95) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:492) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
[2019-09-30T10:36:58,210][INFO ][o.e.x.m.p.NativeController] [esnode3] Native controller process has stopped - no new native processes can be started

and

[2019-09-30T10:39:59,737][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [esnode2] uncaught exception in thread [main]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index dns-2019.09.22 [location: ZMenLry9Qxe5-2-XNrWj2A, generation: 15]];
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:163) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:150) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-cli-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:115) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:92) ~[elasticsearch-7.3.2.jar:7.3.2]
Caused by: org.elasticsearch.ElasticsearchException: failed to bind service
        at org.elasticsearch.node.Node.<init>(Node.java:617) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more
Caused by: java.io.IOException: failed to find metadata for existing index dns-2019.09.22 [location: ZMenLry9Qxe5-2-XNrWj2A, generation: 15]
        at org.elasticsearch.gateway.MetaStateService.loadFullState(MetaStateService.java:99) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.upgradeMetaData(GatewayMetaState.java:141) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.gateway.GatewayMetaState.<init>(GatewayMetaState.java:95) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:492) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.node.Node.<init>(Node.java:258) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:221) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:349) ~[elasticsearch-7.3.2.jar:7.3.2]
        at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:159) ~[elasticsearch-7.3.2.jar:7.3.2]
        ... 6 more

Is it possible that data of closed indexes is not being persisted properly (leading to issues at restart)? Can this be mitigated somehow (Maybe rolling back to less stronger consistency checks)?

@redbaron4 redbaron4 changed the title Failed to find metadata for closed index on every restart Elasticsearch fails to start with error: "Failed to find metadata for index" on every restart Sep 30, 2019
@ywelsch ywelsch added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Sep 30, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch
Copy link
Contributor

ywelsch commented Sep 30, 2019

Hi @redbaron4. Thanks for reporting this. A few questions:

  • Are these master-eligible nodes?
  • Are these closed replicated indices (i.e. are there allocated shards for these indices)? Have these indices been closed in 6.x or 7.x?

@ywelsch
Copy link
Contributor

ywelsch commented Sep 30, 2019

I have now managed to reproduce this with data-only nodes:

public void testRelocatedClosedIndexIssue() throws Exception {
    final String indexName = "closed-index";
    List<String> dataNodes = internalCluster().startDataOnlyNodes(2);
    // allocate shard to first data node
    createIndex(indexName, Settings.builder()
        .put(IndexMetaData.SETTING_NUMBER_OF_SHARDS, 1)
        .put(IndexMetaData.SETTING_NUMBER_OF_REPLICAS, 0)
        .put("index.routing.allocation.include._name", String.join(",", dataNodes.get(0)))
        .build());
    indexRandom(randomBoolean(), randomBoolean(), randomBoolean(), IntStream.range(0, randomIntBetween(0, 50))
        .mapToObj(n -> client().prepareIndex(indexName, "_doc").setSource("num", n)).collect(toList()));
    assertAcked(client().admin().indices().prepareClose(indexName));
    // move single shard to second node
    client().admin().indices().prepareUpdateSettings(indexName).setSettings(Settings.builder()
        .put("index.routing.allocation.include._name", String.join(",", dataNodes.get(1)))).get();
    ensureGreen(indexName);
    // give first node a bit of time to clean up the index (including index metadata)
    Thread.sleep(100);

    internalCluster().fullRestart(); // fails as it can't find the index metadata on the first node
}

The same should not happen with master-eligible data nodes though. Can you clarify that point for us? If this only affects data-only nodes we might be able to provide instructions on how to get the node running again, without losing data.

@redbaron4
Copy link
Author

redbaron4 commented Sep 30, 2019

@ywelsch Thanks for looking at this

  • These are not master eligible nodes. There are 3 other master elegible nodes in the cluster (which don't store any data). While doing a rolling upgrade, the script processes all the master eligible nodes first and then moves to data nodes.

All data nodes (the kind which show failure) have similar config to one given below

cluster.name: nemo-cluster 

node.name: esnode1

bootstrap.memory_lock: true
network.host: _site:ipv4_
discovery.zen.ping.unicast.hosts: 
    - 10.44.0.43
    - 10.44.0.44
    - 10.44.0.45
    
discovery.zen.minimum_master_nodes: 1

# Fix 9201 for intra cluster comm
transport.port: 9201

path.logs: /var/log/elasticsearch
path.data: /data/elasticsearch

# Settings related to node

node.master: false
node.data: true
node.ingest: true

# Monitoring settings (6.3+)
xpack.monitoring.enabled: true

The master elgible nodes are 10.44.0.43,44,45 which don't show any failure.

  • Closed indices are mostly created on 7.x. For the latest issue, the indices mentioned in the logs above were created using 7.x

@ywelsch
Copy link
Contributor

ywelsch commented Sep 30, 2019

Ok, this confirms my findings. First of all, this is a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index metadata storage mechanism, which has special handling for closed indices. On non-master-eligible data nodes, it's possible for the node's manifest file (which tracks the relevant metadata state that the node should persist) to become out of sync with what's actually stored on disk, leading to an inconsistency that is then detected at startup, refusing for the node to start up. We will immediately start working on a bug fix for this.

In the meanwhile, the following workaround is applicable to get the node running again. This workaround should not lead to any data loss. However, great care must be taken before applying it, preferably backing up the data folder on the node before undertaking the following low-level surgery:

  • The first step is to make sure that the node is indeed a data-only node. Doing the following on a master-eligible node can and will put the whole cluster at risk of data loss.
  • For every path that is referenced in path.data, remove the file named manifest-N.st (where N is some number) which can be found under nodes/0/_state/ in the respective data path. After removing this file, the node should properly start up again.

@redbaron4
Copy link
Author

Thanks for the work around. I'll try it the next time we face this situation. I almost had the impulse of removing the manifest file but did not do it. I tried to remove offending index entries from the manifest file first which led to consistency checks failure. So I restored the manifest file and desisted from any more tinkerings with it :)

I hope the bug gets fixed soon.

ywelsch added a commit that referenced this issue Sep 30, 2019
…-only node (#47285)

Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index
metadata storage mechanism, which has special handling for closed indices (but incorrectly
handles replicated closed indices). On non-master-eligible data nodes, it's possible for the
node's manifest file (which tracks the relevant metadata state that the node should persist) to
become out of sync with what's actually stored on disk, leading to an inconsistency that is then
detected at startup, refusing for the node to start up.

Closes #47276
ywelsch added a commit that referenced this issue Sep 30, 2019
…-only node (#47285)

Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index
metadata storage mechanism, which has special handling for closed indices (but incorrectly
handles replicated closed indices). On non-master-eligible data nodes, it's possible for the
node's manifest file (which tracks the relevant metadata state that the node should persist) to
become out of sync with what's actually stored on disk, leading to an inconsistency that is then
detected at startup, refusing for the node to start up.

Closes #47276
ywelsch added a commit that referenced this issue Sep 30, 2019
…-only node (#47285)

Fixes a bug related to how "closed replicated indices" (introduced in 7.2) interact with the index
metadata storage mechanism, which has special handling for closed indices (but incorrectly
handles replicated closed indices). On non-master-eligible data nodes, it's possible for the
node's manifest file (which tracks the relevant metadata state that the node should persist) to
become out of sync with what's actually stored on disk, leading to an inconsistency that is then
detected at startup, refusing for the node to start up.

Closes #47276
@ywelsch ywelsch added the >bug label Sep 30, 2019
fixmebot bot referenced this issue in VectorXz/elasticsearch Apr 22, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch May 28, 2021
fixmebot bot referenced this issue in VectorXz/elasticsearch Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection.
Projects
None yet
3 participants