Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elasticsearch-shard remove-corrupted-data doesn't work on missing metadata #47435

Closed
ct0br0 opened this issue Oct 2, 2019 · 16 comments
Closed
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. feedback_needed

Comments

@ct0br0
Copy link

ct0br0 commented Oct 2, 2019

elasticsearch-shard appears to be the tool for removing corrupted metadata.
This has happened several times to us after updating past 7.0.0

Issue: directory structure and files are either deleted or never created, and elasticsearch-shard (remove-corrupted-data) can not remove it from the metadata

Steps:
recreate directory structure (as elasticsearch-shard errors out "directory must exist" if it does not)
run elasticsearch-shard (hits null pointer exception, because only directories exist)

/usr/share/elasticsearch/bin $ ./elasticsearch-shard remove-corrupted-data --index dce_rpc-2019.08.28 --shard-id 24 -d /data/nsm/elasticsearch/nodes/0/indices/TUa5c332RFGKmM6yZSK-Rw/0/index
ERROR StatusLogger No Log4j 2 configuration file found. Using default configuration (logging only errors to the console), or user programmatically provided configurations. Set system property 'log4j2.debug' to show Log4j 2 internal initialization logging. See https://logging.apache.org/log4j/2.x/manual/configuration.html for instructions on how to configure Log4j 2

WARNING: Elasticsearch MUST be stopped before running this tool.

Please make a complete backup of your index before using this tool.


Exception in thread "main" java.lang.NullPointerException
at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.findAndProcessShardPath(RemoveCorruptedShardDataCommand.java:152)
at org.elasticsearch.index.shard.RemoveCorruptedShardDataCommand.execute(RemoveCorruptedShardDataCommand.java:282)
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:86)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
at org.elasticsearch.cli.MultiCommand.execute(MultiCommand.java:77)
at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:124)
at org.elasticsearch.cli.Command.main(Command.java:90)
at org.elasticsearch.index.shard.ShardToolCli.main(ShardToolCli.java:35)

/usr/share/elasticsearch/bin $ ls /data/nsm/elasticsearch/nodes/0/indices/TUa5c332RFGKmM6yZSK-Rw/0/
index _state translog

What I'd expect:
Kill the shard and not have to rm -rf the entire node and rely on replicas.

Hopefully there's an error in my steps.

elastic 7.3.0 (no plugins)
oracle linux 7.6
network drives (vSAN) for elastic storage (though this happens on physical boxes with docker containers too)

@DaveCTurner DaveCTurner added the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Oct 2, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@DaveCTurner
Copy link
Contributor

Sorry @ct0br0 I cannot reproduce this from the instructions that you have given because they are too vague. What do you mean "reproduce the directory structure"? Can you share a sequence of the specific commands you are running?

@ct0br0
Copy link
Author

ct0br0 commented Oct 2, 2019

that last part i had to make because it is either not created or deleted and the metadata isn't deleted with it

mkdir -p /data/nsm/elasticsearch/nodes/0/indices/TUa5c332RFGKmM6yZSK-Rw/0/{index,_state,translog}

@DaveCTurner
Copy link
Contributor

You should absolutely never modify the contents of the data path yourself. Can you go back a few steps and describe why you are doing this?

@ct0br0
Copy link
Author

ct0br0 commented Oct 2, 2019

if you were to take a directory that has shard data in it and nuke everything, then re-create the directories, that is what we are getting when we have planned or unplanned outages (you can stop the elasticsearch service)

@ct0br0
Copy link
Author

ct0br0 commented Oct 2, 2019

i don't know how else to explain it...the data isn't there. it is either never there and bad metadata is created, or it is in the middle of being deleted and the metadata isn't updated. usually it happens when elastic stops, but has happened in the middle of running and been the reason for elastic stopping before too. physical drives, networked drives, doesn't matter.

@DaveCTurner
Copy link
Contributor

I'm struggling to follow what you are trying to describe. Please slow down. Are you deleting things from the data path yourself or are you saying that this happens on its own? If it happens on its own then that's unexpected and we should address that. Can you share the logs from such a case?

@ct0br0
Copy link
Author

ct0br0 commented Oct 2, 2019

yes, on its own.
from elastic log output

java.io.IOException: failed to find metadata for existing index dce_rpc-2019.08.28 [location: TUa5c332RFGKmM6yZSK-Rw, generation: 34]
Caused by: java.io.IOException: failed to find metadata for existing index dce_rpc-2019.08.28 [location: TUa5c332RFGKmM6yZSK-Rw, generation: 34]
org.elasticsearch.bootstrap.StartupException: ElasticsearchException[failed to bind service]; nested: IOException[failed to find metadata for existing index dce_rpc-2019.08.28 [location: TUa5c332RFGKmM6yZSK-Rw, generation: 34]];
Caused by: java.io.IOException: failed to find metadata for existing index dce_rpc-2019.08.28 [location: TUa5c332RFGKmM6yZSK-Rw, generation: 34]

the fact that it's happening is most likely a different issue than the behavior of elasticsearch-shard remove-corrupted-data

@DaveCTurner
Copy link
Contributor

Ok, this means that some metadata is not where it should be. I wouldn't expect elasticsearch-shard remove-corrupted-data to help at all here, because this tool is only interested in corrupted data, not metadata.

Can you share a diagnostics bundle from your cluster please?

@ct0br0
Copy link
Author

ct0br0 commented Oct 2, 2019

it looks like you have to build it? if this is the case i cannot run it in our environment

can this be a feature request then? add a remove-shard option? or would this be a problem with how elastic works?

@DaveCTurner
Copy link
Contributor

Ok, if you can't provide diagnostics then can you tell us a lot more detail about your cluster, and about the node that's affected, and about the dce_rpc-2019.08.28 index? Things like the node's elasticsearch.yml and the outputs from GET _nodes, GET _nodes/stats, GET dce_rpc-2019.08.28/_settings would all be useful.

The fix for this isn't to add a tool to clean up some mess, it's to prevent the mess from happening in the first place. And for that we need to understand how it's happening.

@ct0br0
Copy link
Author

ct0br0 commented Oct 2, 2019

sure thing. i'm not sure i can upload the node information with hostnames and IPs on here though

was 73 data nodes, a few went bad and we are currently at 69
3 masters
2 cords
12 logstash

the dce_rpc is from bro/zeek https://github.com/zeek fed in by filebeat. but various other indexes have had this issue.

elasticsearch.yml


cluster.name: nsm
node.name: removed-data-node
path.data: "/data/nsm/elasticsearch"
path.logs: "/data/nsm/log/elasticsearch"
network.host: 0.0.0.0
http.port: '9200'
transport.tcp.port: '9300'
discovery.seed_hosts:

  • removed-docker-master1:9304
  • removed-docker-master2:9304
  • removed-docker-master3:9304
    discovery.zen.minimum_master_nodes: '2'
    cluster.join.timeout: '120s'
    node.ml: false
    node.data: true
    node.ingest: true
    node.master: false

xpack.security.http.ssl.key: /etc/elasticsearch/certs/host.key
xpack.security.http.ssl.certificate: /etc/elasticsearch/certs/host.pem
xpack.security.http.ssl.certificate_authorities: /etc/elasticsearch/certs/chain.pem
xpack.security.enabled: true
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.verification_mode: none
xpack.security.transport.ssl.key: /etc/elasticsearch/certs/host.key
xpack.security.transport.ssl.certificate: /etc/elasticsearch/certs/host.pem
xpack.security.transport.ssl.certificate_authorities: /etc/elasticsearch/certs/chain.pem

xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: none
xpack.watcher.index.rest.direct_access: true
xpack.security.transport.ssl.supported_protocols: [ "TLSv1.2", "TLSv1.1"]
xpack.security.http.ssl.supported_protocols: [ "TLSv1.2", "TLSv1.1"]

dce_rpc settings

"dce_rpc-2019.08.28" : {
"settings" : {
"index" : {
"verified_before_close" : "true",
"blocks" : {
"write" : "true"
},
"provided_name" : "dce_rpc-2019.08.28",
"frozen" : "true",
"creation_date" : "1566950340886",
"priority" : "0",
"number_of_replicas" : "5",
"uuid" : "TUa5c332RFGKmM6yZSK-Rw",
"version" : {
"created" : "7030099"
},
"lifecycle" : {
"name" : "suricata"
},
"routing" : {
"allocation" : {
"total_shards_per_node" : "1"
}
},
"search" : {
"throttled" : "true"
},
"number_of_shards" : "1"
}
}
}
}

@DaveCTurner
Copy link
Contributor

Thanks, that is very helpful. We think this could be another instance of #47276, because this index is frozen and was therefore (briefly) a closed replicated index. The fix is #47285 and, in the meantime, #47276 (comment) describes a possible workaround.

@ct0br0
Copy link
Author

ct0br0 commented Oct 2, 2019

we just have to remove the manifest*.st file?
will try next time. thanks dave!

@ct0br0
Copy link
Author

ct0br0 commented Oct 7, 2019

hey, wanted to let you know that works fantastically.
thanks for the work around

@ebadyano
Copy link
Contributor

Closing as the user confirmed that the work around worked for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. feedback_needed
Projects
None yet
Development

No branches or pull requests

4 participants