elasticsearch-shard remove-corrupted-data doesn't work on missing metadata #47435

ct0br0 · 2019-10-02T11:25:12Z

elasticsearch-shard appears to be the tool for removing corrupted metadata.
This has happened several times to us after updating past 7.0.0

Issue: directory structure and files are either deleted or never created, and elasticsearch-shard (remove-corrupted-data) can not remove it from the metadata

Steps:
recreate directory structure (as elasticsearch-shard errors out "directory must exist" if it does not)
run elasticsearch-shard (hits null pointer exception, because only directories exist)

/usr/share/elasticsearch/bin $ ./elasticsearch-shard remove-corrupted-data --index dce_rpc-2019.08.28 --shard-id 24 -d /data/nsm/elasticsearch/nodes/0/indices/TUa5c332RFGKmM6yZSK-Rw/0/index
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2
WARNING: Elasticsearch MUST be stopped before running this tool.
Please make a complete backup of your index before using this tool.

Exception in thread "main" java.lang.NullPointerException
at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.findAndProcessShardPath(RemoveCorruptedShardDataCommand.java:152)
at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.execute(RemoveCorruptedShardDataCommand.java:282)
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:77)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
at org.elasticsearch.cli.Command.main(Command.java:90)
at org.elasticsearch.index.shard.ShardToolCli.main(ShardToolCli.java:35)

/usr/share/elasticsearch/bin $ ls /data/nsm/elasticsearch/nodes/0/indices/TUa5c332RFGKmM6yZSK-Rw/0/
index _state translog

What I'd expect:
Kill the shard and not have to rm -rf the entire node and rely on replicas.

Hopefully there's an error in my steps.

elastic 7.3.0 (no plugins)
oracle linux 7.6
network drives (vSAN) for elastic storage (though this happens on physical boxes with docker containers too)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-02T11:32:00Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

DaveCTurner · 2019-10-02T11:33:24Z

Sorry @ct0br0 I cannot reproduce this from the instructions that you have given because they are too vague. What do you mean "reproduce the directory structure"? Can you share a sequence of the specific commands you are running?

ct0br0 · 2019-10-02T11:36:12Z

that last part i had to make because it is either not created or deleted and the metadata isn't deleted with it

mkdir -p /data/nsm/elasticsearch/nodes/0/indices/TUa5c332RFGKmM6yZSK-Rw/0/{index,_state,translog}

DaveCTurner · 2019-10-02T11:37:35Z

You should absolutely never modify the contents of the data path yourself. Can you go back a few steps and describe why you are doing this?

ct0br0 · 2019-10-02T11:38:01Z

if you were to take a directory that has shard data in it and nuke everything, then re-create the directories, that is what we are getting when we have planned or unplanned outages (you can stop the elasticsearch service)

ct0br0 · 2019-10-02T11:39:28Z

i don't know how else to explain it...the data isn't there. it is either never there and bad metadata is created, or it is in the middle of being deleted and the metadata isn't updated. usually it happens when elastic stops, but has happened in the middle of running and been the reason for elastic stopping before too. physical drives, networked drives, doesn't matter.

DaveCTurner · 2019-10-02T11:44:14Z

I'm struggling to follow what you are trying to describe. Please slow down. Are you deleting things from the data path yourself or are you saying that this happens on its own? If it happens on its own then that's unexpected and we should address that. Can you share the logs from such a case?

ct0br0 · 2019-10-02T12:00:11Z

yes, on its own.
from elastic log output

java.io.IOException: failed to find metadata for existing index dce_rpc-2019.08.28 [location: TUa5c332RFGKmM6yZSK-Rw, generation: 34]
Caused by: java.io.IOException: failed to find metadata for existing index dce_rpc-2019.08.28 [location: TUa5c332RFGKmM6yZSK-Rw, generation: 34]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index dce_rpc-2019.08.28 [location: TUa5c332RFGKmM6yZSK-Rw, generation: 34]];
Caused by: java.io.IOException: failed to find metadata for existing index dce_rpc-2019.08.28 [location: TUa5c332RFGKmM6yZSK-Rw, generation: 34]

the fact that it's happening is most likely a different issue than the behavior of elasticsearch-shard remove-corrupted-data

DaveCTurner · 2019-10-02T12:08:57Z

Ok, this means that some metadata is not where it should be. I wouldn't expect elasticsearch-shard remove-corrupted-data to help at all here, because this tool is only interested in corrupted data, not metadata.

Can you share a diagnostics bundle from your cluster please?

ct0br0 · 2019-10-02T12:15:32Z

it looks like you have to build it? if this is the case i cannot run it in our environment

can this be a feature request then? add a remove-shard option? or would this be a problem with how elastic works?

DaveCTurner · 2019-10-02T13:00:21Z

Ok, if you can't provide diagnostics then can you tell us a lot more detail about your cluster, and about the node that's affected, and about the dce_rpc-2019.08.28 index? Things like the node's elasticsearch.yml and the outputs from GET _nodes, GET _nodes/stats, GET dce_rpc-2019.08.28/_settings would all be useful.

The fix for this isn't to add a tool to clean up some mess, it's to prevent the mess from happening in the first place. And for that we need to understand how it's happening.

ct0br0 · 2019-10-02T13:24:31Z

sure thing. i'm not sure i can upload the node information with hostnames and IPs on here though

was 73 data nodes, a few went bad and we are currently at 69
3 masters
2 cords
12 logstash

the dce_rpc is from bro/zeek https://github.com/zeek fed in by filebeat. but various other indexes have had this issue.

elasticsearch.yml

cluster.name: nsm
node.name: removed-data-node
path.data: "/data/nsm/elasticsearch"
path.logs: "/data/nsm/log/elasticsearch"
network.host: 0.0.0.0
http.port: '9200'
transport.tcp.port: '9300'
discovery.seed_hosts:

removed-docker-master1:9304

removed-docker-master2:9304

removed-docker-master3:9304
discovery.zen.minimum_master_nodes: '2'
cluster.join.timeout: '120s'
node.ml: false
node.data: true
node.ingest: true
node.master: false

xpack.security.http.ssl.key: /etc/elasticsearch/certs/host.key
xpack.security.http.ssl.certificate: /etc/elasticsearch/certs/host.pem
xpack.security.http.ssl.certificate_authorities: /etc/elasticsearch/certs/chain.pem
xpack.security.enabled: true
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.verification_mode: none
xpack.security.transport.ssl.key: /etc/elasticsearch/certs/host.key
xpack.security.transport.ssl.certificate: /etc/elasticsearch/certs/host.pem
xpack.security.transport.ssl.certificate_authorities: /etc/elasticsearch/certs/chain.pem

xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: none
xpack.watcher.index.rest.direct_access: true
xpack.security.transport.ssl.supported_protocols: [ "TLSv1.2", "TLSv1.1"]
xpack.security.http.ssl.supported_protocols: [ "TLSv1.2", "TLSv1.1"]

dce_rpc settings

"dce_rpc-2019.08.28" : {
"settings" : {
"index" : {
"verified_before_close" : "true",
"blocks" : {
"write" : "true"
},
"provided_name" : "dce_rpc-2019.08.28",
"frozen" : "true",
"creation_date" : "1566950340886",
"priority" : "0",
"number_of_replicas" : "5",
"uuid" : "TUa5c332RFGKmM6yZSK-Rw",
"version" : {
"created" : "7030099"
},
"lifecycle" : {
"name" : "suricata"
},
"routing" : {
"allocation" : {
"total_shards_per_node" : "1"
}
},
"search" : {
"throttled" : "true"
},
"number_of_shards" : "1"
}
}
}
}

DaveCTurner · 2019-10-02T14:12:03Z

Thanks, that is very helpful. We think this could be another instance of #47276, because this index is frozen and was therefore (briefly) a closed replicated index. The fix is #47285 and, in the meantime, #47276 (comment) describes a possible workaround.

ct0br0 · 2019-10-02T20:56:44Z

we just have to remove the manifest*.st file?
will try next time. thanks dave!

ct0br0 · 2019-10-07T22:51:59Z

hey, wanted to let you know that works fantastically.
thanks for the work around

ebadyano · 2019-11-13T14:27:53Z

Closing as the user confirmed that the work around worked for them.

DaveCTurner added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Oct 2, 2019

DaveCTurner added the feedback_needed label Oct 2, 2019

ebadyano closed this as completed Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

elasticsearch-shard remove-corrupted-data doesn't work on missing metadata #47435

elasticsearch-shard remove-corrupted-data doesn't work on missing metadata #47435

ct0br0 commented Oct 2, 2019 •

edited

Loading

elasticmachine commented Oct 2, 2019

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019 •

edited

Loading

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019

ct0br0 commented Oct 2, 2019 •

edited

Loading

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019 •

edited

Loading

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019

ct0br0 commented Oct 7, 2019

ebadyano commented Nov 13, 2019

elasticsearch-shard remove-corrupted-data doesn't work on missing metadata #47435

elasticsearch-shard remove-corrupted-data doesn't work on missing metadata #47435

Comments

ct0br0 commented Oct 2, 2019 • edited Loading

elasticmachine commented Oct 2, 2019

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019 • edited Loading

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019

ct0br0 commented Oct 2, 2019 • edited Loading

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019 • edited Loading

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019

DaveCTurner commented Oct 2, 2019

ct0br0 commented Oct 2, 2019

ct0br0 commented Oct 7, 2019

ebadyano commented Nov 13, 2019

ct0br0 commented Oct 2, 2019 •

edited

Loading

ct0br0 commented Oct 2, 2019 •

edited

Loading

ct0br0 commented Oct 2, 2019 •

edited

Loading

ct0br0 commented Oct 2, 2019 •

edited

Loading