-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tool to remove corrupted parts of corrupt shards #31389
Comments
Pinging @elastic/es-distributed |
+1. That settings is dangerous :( |
We (@elastic/es-distributed) discussed this today and decided:
|
It has been observed that elasticsearch/server/src/main/java/org/elasticsearch/index/shard/IndexShard.java Line 1292 in f04c579
|
I don't think this is correct elasticsearch/server/src/main/java/org/elasticsearch/index/shard/IndexShard.java Line 1890 in f04c579
UPDATE: It seems that method is dead code. |
|
description is updated |
as discussed one of the upside of a tool vs running lucene directly is the translation between index names and index folders. I think we should allow people to specify an index name and an shard id as parameters. |
+1 to +1 to pass in an index name and a shard rather than a folder.
Or maybe
I'd probably not expose these options at all, and always run with fast=true and crossCheckTermVectors=false.
Since the name of the command already implies data loss I'm not sure we need this one. Maybe turn it around and make it a |
@jpountz I like idea of |
I'm open to ideas here as long as the fact that this command will cause data loss is clear. My thinking was that since the command is already called |
We've had a good discussion around this tool and have concluded the following:
We have run out of time and didn't discuss the parameters and tool naming. @vladimirdolzhenko can you post a suggestion here based on the above and we can discuss it further? |
|
Relates elastic#31389 (cherry picked from commit a3e8b83)
Closed by #32281. |
Today, if we detect shard corruption then we mark the store as corrupt and refuse to open it again. If there are no replicas then you might be able to use Lucene’s CheckIndex to remove the corrupted segments, although this does not remove the corruption marker, requires knowledge of our filesystem layout, and might be tricky to do in a containerised or heavily automated environment. The only way forward via the API is to force the allocation of an empty primary which drops all the data in the shard. We have an
index.shard.check_on_startup: fix
setting but this is suboptimal for a couple of reasons:(it also does nothing in versions 6.0 and above, but that's another story)
The Right Way™ to recover a corrupted shard is certainly to fail it and recover another copy from one of its replicas, assuming such a thing exists, but we’ve seen a couple of cases recently where a user was running without replicas, e.g. to do a bulk load of data (which we sorta suggest might be a good idea sometimes) and hit some corruption that they'd have preferred to recover from with a bit of data loss rather than by restarting the load or allocating an empty primary.
I propose removing the
fix
option of theindex.shard.check_on_startup
setting and instead adding another dangerous forced allocation command that can attempt to allocate a primary on top of a corrupt store by fixing the store and removing its corruption marker./cc @tsouza @ywelsch re. this forum thread
Actual points and opened questions:
elasticsearch-shard
with subcommandremove-corrupted-segments
remove-corrupted-segments
:--index-name index_name
and--shard-id shard_id
(mandatory)-d path_to_index_folder
or--dir path_to_index_folder
--dry-run
do fast check without actual dropping of corrupted segmentsexorcise
- interactive keyboard confirmation is requiredelasticsearch-translog
intoelasticsearch-shard
elasticsearch-translog
becomeselasticsearch-shard truncate-translog
elasticsearch-translog
has only-d
option to specify folder - it would be nice to have--index-name index_name
and--shard-id shard_id
checkIndex
The text was updated successfully, but these errors were encountered: