Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add RemoveCorruptedShardDataCommand #32281

Merged
merged 93 commits into from
Sep 19, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
843f977
drop `index.shard.check_on_startup: fix`
Jul 23, 2018
4f01609
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Jul 31, 2018
a8f1488
add RemoveCorruptedSegmentsCommand; merge elasticsearch-translog and …
Jul 23, 2018
5f6b084
fix test with ClusterAllocationExplanation
Aug 20, 2018
1fc72e9
fix test with ClusterAllocationExplanation
Aug 21, 2018
153e4f2
create corrupted marker on `check_on_startup: true`; split testIndexC…
Aug 21, 2018
2964fef
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Aug 21, 2018
c71e306
create manually corruption marker (but don't corrupt index files) to …
Aug 21, 2018
a7668d6
checkstyle fix
Aug 21, 2018
6ee74a0
merge into ResolveShardCorruptionCommand
Aug 22, 2018
ee955b0
check is _state folder exist before reading state
Aug 22, 2018
918ce41
merge two commands into a single remove-corrupted-segments
Aug 24, 2018
97fa399
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Aug 24, 2018
ebef6d2
Merge remote-tracking branch 'remotes/origin/fix/31389_1' into fix/31…
Aug 24, 2018
5cddefb
fixes after merge with remote-tracking branch 'remotes/origin/fix/313…
Aug 25, 2018
fd407bb
move corruptIndex to CorruptionUtils
Aug 25, 2018
4bc9c95
reworked resolveShardPath
Aug 25, 2018
b29aa9a
split testShardLock; testCorruptedBothIndexAndTranslog is added
Aug 26, 2018
9ceeaf4
simplified test
Aug 26, 2018
addb03f
test code cleanup
Aug 26, 2018
0f29f0f
test code cleanup
Aug 27, 2018
e6c6d70
checkstyle
Aug 27, 2018
c155b36
addressed unit test comments
Aug 27, 2018
85b7eef
keep `fix` for 6.x branch
Aug 27, 2018
7f292e3
drop unused class
Aug 27, 2018
43ae3a1
remove-corrupted-data subcommand instead of remove-corrupted-segments
Aug 27, 2018
087d558
remove-corrupted-data subcommand instead of remove-corrupted-segments…
Aug 27, 2018
ad819ec
dropped `index.shard.check_on_startup: fix` - it has to go with anoth…
Aug 27, 2018
75fcafa
amendment on a CLI tool name
Aug 27, 2018
cf6837f
a bit of clean up + show translog file names in sorted order instead …
Aug 27, 2018
260a5f4
keep node lock on shard shamanizing; fix allocate empty primary; inst…
Aug 27, 2018
3de84e2
fix node lock scope
Aug 28, 2018
073d29f
renamed to RemoveCorruptedShardDataCommand
Aug 28, 2018
03bbc5f
added test for multi-node layout for a single env
Aug 28, 2018
3231803
added `fix` deprecation log message + test
Aug 28, 2018
64c29db
dropped `dry-run`
Aug 28, 2018
d1805d6
keep elasticsearch-translog for 6.x
Aug 28, 2018
c2b5b8a
added `fix` deprecation log message + test
Aug 28, 2018
14e6175
adjusted `fix` deprecation log message
Aug 28, 2018
fee8a5b
dropped `fix` to avoid deprecation warnings
Aug 28, 2018
e1808d6
Merge remote-tracking branch 'remotes/origin/fix/31389_1' into fix/31…
Aug 28, 2018
5b5d516
set 755 to elasticsearch-shard, elasticsearch-translog
Aug 28, 2018
5cee2b9
skip files added by Lucene's ExtrasFS
Aug 28, 2018
b11670c
skip files added by Lucene's ExtrasFS
Aug 28, 2018
e38238a
skip files added by Lucene's ExtrasFS
Aug 28, 2018
ad62da0
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Aug 28, 2018
6f6ca5a
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Aug 29, 2018
6763cf9
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Aug 29, 2018
7f1f6f3
Merge branch 'fix/31389_1' into fix/31389_2
Aug 29, 2018
5083e83
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Aug 31, 2018
2a9dbeb
resolved conflicts on Merge remote-tracking branch 'remotes/origin/ma…
Aug 31, 2018
d165a6c
Merge branch 'fix/31389_1' into fix/31389_2
Aug 31, 2018
f985de4
resolve conflict after Merge branch 'fix/31389_1' into fix/31389_2
Aug 31, 2018
aa16487
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Aug 31, 2018
f74c058
Merge remote-tracking branch 'remotes/origin/fix/31389_1' into fix/31…
Aug 31, 2018
28c6a5a
checkstyle
Aug 31, 2018
24bc3d4
added comment on the reason to keep index lock
Aug 31, 2018
2d2dd2b
dropped left-over
Aug 31, 2018
e196e9e
addressed documentation review comments (links, clean up)
Aug 31, 2018
4d89496
removed misleading comments
Aug 31, 2018
5bdb069
clean up; inlining of resolveShardPath; text adjustments
Aug 31, 2018
5349c72
extracted lock logic from NodeEnvironment ctor into NodeLock; reused …
Aug 31, 2018
4286800
reworked resolve shard path
Aug 31, 2018
01be5af
added Lucene.SOFT_DELETES_FIELD to IndexWriter
Aug 31, 2018
af64fd4
polish a bit NodeLock
Aug 31, 2018
d26fbfb
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_1
Aug 31, 2018
f8fd76a
Merge remote-tracking branch 'remotes/origin/fix/31389_1' into fix/31…
Aug 31, 2018
47fa3fa
Merge branch 'remote/origin/master' into fix/31389_2
Aug 31, 2018
3a4916a
checkstyle
Sep 1, 2018
9f3a7fb
dropped testCheckOnStartupDeprecatedValue due to wrong merge with master
Sep 1, 2018
abcff3c
fix NodeEnvironment.NodeLock
Sep 1, 2018
91dc295
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_2
Sep 4, 2018
33f3a45
improved message on delete marker
Sep 4, 2018
c796417
minor test code style change
Sep 5, 2018
4181988
fix test
Sep 5, 2018
a1593e8
fix test
Sep 5, 2018
185adc9
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_2
Sep 6, 2018
f5cf90a
move shard-tool doc next to other docs
Sep 6, 2018
8de0ae5
fix [float] Removing a corrupted data files header
Sep 6, 2018
5b29ad0
Merge remote-tracking branch 'remotes/origin/master' into fix/31716_2
Sep 10, 2018
8242bbb
after merge fixes
Sep 10, 2018
418c922
Tweaks to docs
DaveCTurner Sep 10, 2018
674d1ba
dropped unrelated checkIndexOnStartup = fix setting
Sep 10, 2018
53b404a
nodeEnv code style clean up
Sep 10, 2018
24ffdd1
do not expose node lock; code style adjustment; text comment adjustment
Sep 10, 2018
2a3f58d
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_2
Sep 10, 2018
e1eb32f
tiny doc amendment
Sep 13, 2018
ee1f6a2
NodeEnvironment.NodeLock can skip node path if it is required
Sep 13, 2018
1df4685
Merge remote-tracking branch 'remotes/origin/master' into fix/31389_2
Sep 13, 2018
844adaf
after merge fix
Sep 13, 2018
8210f3b
after merge fix
Sep 14, 2018
dab5125
inline nodeLock
Sep 18, 2018
54c4030
add javadoc comment for pathFunction
Sep 18, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions distribution/src/bin/elasticsearch-shard
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#!/bin/bash

ES_MAIN_CLASS=org.elasticsearch.index.shard.ShardToolCli \
"`dirname "$0"`"/elasticsearch-cli \
"$@"
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
setlocal enabledelayedexpansion
setlocal enableextensions

set ES_MAIN_CLASS=org.elasticsearch.index.translog.TranslogToolCli
set ES_MAIN_CLASS=org.elasticsearch.index.shard.ShardToolCli
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the elasticsearch-translog tool is a breaking change, so this cannot happen in 6.5. At the moment this PR is only tagged for 7.0, which is ok, but it cannot be backported as-is.

call "%~dp0elasticsearch-cli.bat" ^
%%* ^
|| exit /b 1
Expand Down
5 changes: 0 additions & 5 deletions distribution/src/bin/elasticsearch-translog

This file was deleted.

12 changes: 6 additions & 6 deletions docs/reference/index-modules.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -63,12 +63,6 @@ corruption is detected, it will prevent the shard from being opened. Accepts:
Check for both physical and logical corruption. This is much more
expensive in terms of CPU and memory usage.

`fix`::

Check for both physical and logical corruption. Segments that were reported
as corrupted will be automatically removed. This option *may result in data loss*.
Use with extreme caution!

WARNING: Expert only. Checking shards may take a lot of time on large indices.
--

Expand Down Expand Up @@ -279,6 +273,10 @@ Other index settings are available in index modules:

Control over the transaction log and background flush operations.

<<index-modules-command-line-tools,Command-line tools>>::

Command-line tools if shard is corrupted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the single elasticsearch-shard tool, I'd refer to it as a single tool. This should be a floated level 3 heading instead of an item in the settings list. I'd go with:

[float]
[[shard-recovery-tool]]
=== Shard recovery tool

You can use the <<index-modules-elasticsearch-shard,elasticsearch-shard>> recovery tool to remove corrupted translog or corrupted Lucene segments if a shard cannot be recovered automatically or restored from backup.

--

include::index-modules/analysis.asciidoc[]
Expand All @@ -297,4 +295,6 @@ include::index-modules/store.asciidoc[]

include::index-modules/translog.asciidoc[]

include::index-modules/command-line-tools.asciidoc[]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcawl Instead of including this in the index-modules section, do we want to just go ahead and rename the X-Pack commands section "Command line tools" and include it there? It can still be linked to from the index-modules page, but it would be great if we could move toward having a single command reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be great!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #33005 for that purpose.

include::index-modules/index-sorting.asciidoc[]
213 changes: 213 additions & 0 deletions docs/reference/index-modules/command-line-tools.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
[[index-modules-command-line-tools]]

== Command-line tools

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This topic should use the tool name in the anchor and heading like we do for the X-Pack tools:

[[shard-tool]]
== elasticsearch-shard

The filename should match the anchor: shard-tool.asciidoc. The include (where ever we include the file) will need to be updated accordingly.

In some cases (a bad drive, user error) the translog or Lucene index on a shard copy
can become corrupted. When this corruption is detected by Elasticsearch due to mismatching
checksums, Elasticsearch will fail that shard copy and refuse to use that copy
of the data.

*Note*: If there are other copies of the shard available then
Elasticsearch will automatically recover from one of them using the normal
shard allocation and recovery mechanism. In particular, if the corrupt shard
copy was the primary when the corruption was detected then one of its replicas
will be promoted in its place.

You can also use <<modules-snapshots,snapshot and restore>> to restore the index.

Please consider using of this tool like the last resort if there is no copy of the data
from which Elasticsearch can recover successfully.

We provide a command-line tool for this - `elasticsearch-shard`.

The cost of applying this tool is losing the corrupted data. It could be any lost data
regardless of time: could be a recent or an old data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, try to stick to present tense. I don't think it's necessary to go into what causes data corruption or how the normal shard recovery process works. I'd pare this down to:

The elasticsearch-shard command enables you to remove a corrupted translog or corrupted Lucene segments if a shard cannot be recovered automatically or restored from backup.

WARNING: You will lose the corrupted data when you run elasticsearch-shard. This tool should only be used as a last resort if there is no way to recover from another copy of the shard or restore a snapshot.

When Elasticsearch detects that a shard's translog or Lucene index is corrupted, it fails that shard copy and quits using it. Under normal conditions, the shard is automatically recovered from another copy. If no good copy of the shard is available and you cannot restore from backup, you can use elasticsearch-shard to remove only the corrupted data and restore access to the data in unaffected segments.

This is somewhat redundant, but it reinforces when it's appropriate to use the tool.

[WARNING]
The `elasticsearch-shard` tool should *not* be run while Elasticsearch is
running. If you attempt to run this tool while Elasticsearch is running, you
will permanently lose the documents that were contained only in the translog!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd flip this around: "Stop Elasticsearch before running elasticsearch-shard. Attempting to use elasticsearch-shard while Elasticsearch is running will result in data loss. Any documents contained only in the translog will be permanently deleted."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tool should (does?) protect against running concurrently with an Elasticsearch node by obtaining the node lock for itself, meaning that this warning need not be so emphatically scary. I also think the second sentence is incorrect, you risk losing arbitrarily many documents when running this tool, regardless of whether Elasticsearch is running or not, but you may or may not lose the documents contained only in the translog.

(I think this warning was a bit misleading when talking about the existing translog truncation tool, in the sense that truncating the translog means you permanently lose the documents that were contained only in the translog, regardless of whether Elasticsearch is running concurrently or not)

[WARNING]
After dropping the corrupted part the allocation id of the shard is changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally try to avoid stacking Warnings and Notes. This second one is really more of a next step than a warning--I'd change it to a paragraph and include it in the truncate translog & remove shards sections.

When you use elasticsearch-shard to drop the corrupted data, the shard's allocation ID changes. After you restart the node, you must use the cluster reroute API to tell Elasticsearch to use the new ID. When you run the elasticsearch-shard command, it shows the request that you need to submit.

We aren't totally consistent in what we call a command, but we should avoid referring to Elasticsearch APIs as commands, as that just muddies things further.

`elasticsearch-shard` provides details of command that has to be run after the node
restart to apply changes:
You should run follow command to apply allocation id changes:
[source,txt]
--------------------------------------------------
$ curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '
{
"commands" : [
{
"allocate_stale_primary" : {
"index" : "twitter",
"shard" : 0,
"node" : "pAfJBgAAQACIfI2M_____w",
"accept_data_loss" : true
}
}
]
}'
--------------------------------------------------

=== What to do if the translog becomes corrupted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use a verb or gerund (-ing) pattern for tasks. I'd change this to Removing a corrupted translog

Make sure you explictly specify an anchor for each heading. I think we want to float these headings so it's all on one page?

[float]
[[remove-corrupted-translog]]
=== Removing a corrupted translog

In order to drop corrupted translog use `truncate-translog` subcommand:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In order to" can almost always be replaced by just "To...".

To remove corrupted translog files, use the truncate-translog subcommand. There are two ways to specify the translog:

  • Specify the index name and shard name with the --index and --shard-id options.
  • Use the -d option to specify the full path to the corrupted translog file.

* you should specify index name with `--index` and shard id `--shard-id`
* or specify the full path to corrupted translog with the `-d` option

[source,txt]
--------------------------------------------------
$ bin/elasticsearch-shard truncate-translog -d /var/lib/elasticsearchdata/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/
Checking existing translog files
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! WARNING: Elasticsearch MUST be stopped before running this tool !
! !
! WARNING: Documents inside of translog files will be lost !
! !
! WARNING: The following files will be DELETED! !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-41.ckp
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-6.ckp
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-37.ckp
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-24.ckp
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-11.ckp

Continue and DELETE files? [y/N] y
Reading translog UUID information from Lucene commit from shard at [data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/index]
Translog Generation: 3
Translog UUID : AxqC4rocTC6e0fwsljAh-Q
Removing existing translog files
Creating new empty checkpoint at [data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog.ckp]
Creating new empty translog at [data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-3.tlog]

Marking index with the new history uuid : TAUddBstTciV9wAiA6sKFA
Changing allocation id ceU7CskxT4yRw4M-GLO1mg to nFa2DcCsSlady4LlnJeaEQ
You should run follow command to apply allocation id changes:

$ curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '
{
"commands" : [
{
"allocate_stale_primary" : {
"index" : "twitter",
"shard" : 0,
"node" : "pAfJBgAAQACIfI2M_____w",
"accept_data_loss" : true
}
}
]
}'
Done.
--------------------------------------------------

=== What to do if the Lucene index becomes corrupted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above...I'd change this to:

[float]
[[remove-corrupted-index-segments]]
=== Removing corrupted index segments from a shard

In a similar cases the index on a shard copy can become corrupted.
Like in case with corrupted translog when index corruption is detected by Elasticsearch due
to mismatching checksums, Elasticsearch will fail that shard copy and refuse to use that copy of the data.
If there are other copies of the shard available then Elasticsearch will automatically recover from one of
them using the normal shard allocation and recovery mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary to repeat this information.

In order to remove corrupted segments use `remove-corrupted-segments` subcommand:
It writes a new segments file that removes reference to problematic (corrupted) Lucene segments if there is
no copy of the data from which Elasticsearch can recover successfully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to > To (as above). I'd go with:

To remove corrupted Lucene index segments, use the remove-corrupted-segments subcommand. This command writes a new segments file, omitting references to the corrupted segments. There are two ways to specify the translog:

  • Specify the index name and shard name with the --index and --shard-id options.
  • Use the -d option to specify the full path to the corrupted segment.

Highly recommended to make a complete backup of your index before using this to remove corrupted documents
from your index!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change this to a warning (and also add it to the translog section?).

WARNING: Back up your data before running elasticsearch-shard. This is a destructive operation that removes corrupted data from the shard.

* you should specify index name with `--index` and shard id `--shard-id`
* or specify the full path to corrupted translog with the `-d` option

You can get an overview of the corruption with `--dry-run` option :

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to break the dry-run option out into it's own section:
[float]
[[list-corrupted-segments]]
=== Listing corrupted index segments

To see what segments are corrupted and will be dropped, use the --dry-run option with the remove-corrupted-segments subcommand. There are two ways to specify the shard:

  • Specify the index name and shard name with the --index and --shard-id options.
  • Use the -d option to specify the full path to the corrupted segment.

[source,txt]
--------------------------------------------------
$ bin/elasticsearch-shard remove-corrupted-segments --dry-run --index twitter --shard-id 0

Opening index @ /var/lib/elasticsearchdata/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/index/

WARNING: Corrupted segments found - 568 documents are damaged

--------------------------------------------------

Running `remove-corrupted-segments` without `--dry-run` requires interactive confirmation to drop damaged segments:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just include this as part of the remove segments section and say, "You must confirm that you want to remove the corrupted segments:"

[source,txt]
--------------------------------------------------

$ bin/elasticsearch-shard remove-corrupted-segments --index twitter --shard-id 0

Opening index @ /var/lib/elasticsearchdata/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/index/

Segments file=segments_8 numSegments=6 version=7.4.0 id=efcaej17mrbjqf9jf5js52e94 userData={history_uuid=3Mu-8x3zTMm8TIZxwTkTZw, local_checkpoint=1896, max_seq_no=1896, max_unsafe_auto_id_timestamp=-1, translog_generation=7, translog_uuid=2n8vuupLQWSh5LDQzRe2fQ}
1 of 2: name=_0 maxDoc=1
version=7.4.0
id=efcaej17mrbjqf9jf5js52e8k
codec=Lucene70
compound=true
numFiles=3
size (MB)=0.004
diagnostics = {java.runtime.version=10.0.2+13, java.vendor=Oracle Corporation, java.version=10.0.2, java.vm.version=10.0.2+13, lucene.version=7.4.0, os=Mac OS X, os.arch=x86_64, os.version=10.13.6, source=flush, timestamp=1532081797245}
no deletions
test: open reader.........OK [took 0.001 sec]
test: check integrity.....OK [took 0.000 sec]
test: check live docs.....OK [took 0.000 sec]
test: field infos.........OK [9 fields] [took 0.000 sec]
test: field norms.........OK [2 fields] [took 0.000 sec]
test: terms, freq, prox...OK [5 terms; 5 terms/docs pairs; 2 tokens] [took 0.000 sec]
test: stored fields.......OK [2 total field count; avg 2.0 fields per doc] [took 0.000 sec]
test: term vectors........OK [0 total term vector count; avg 0.0 term/freq vector fields per doc] [took 0.000 sec]
test: docvalues...........OK [5 docvalues fields; 0 BINARY; 3 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 2 SORTED_SET] [took 0.000 sec]
test: points..............OK [1 fields, 1 points] [took 0.000 sec]

2 of 2: name=_1 maxDoc=568
version=7.4.0
id=efcaej17mrbjqf9jf5js52e8q
codec=Lucene70
compound=true
numFiles=3
size (MB)=1.148
diagnostics = {java.runtime.version=10.0.2+13, java.vendor=Oracle Corporation, java.version=10.0.2, java.vm.version=10.0.2+13, lucene.version=7.4.0, os=Mac OS X, os.arch=x86_64, os.version=10.13.6, source=flush, timestamp=1532081798123}
no deletions
test: open reader.........FAILED
WARNING: exorciseIndex() would remove reference to this segment;

WARNING: 1 broken segments (containing 568 documents) detected
Took 0.049 sec total.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! WARNING: 568 documents will be lost. !
! !
! WARNING: YOU WILL LOSE DATA. !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Continue and remove 568 docs from the index ? [y/N]

Writing...
OK
Wrote new segments file "segments_8"

Marking index with the new history uuid : TAUddBstTciV9wAiA6sKFA
Changing allocation id ceU7CskxT4yRw4M-GLO1mg to nFa2DcCsSlady4LlnJeaEQ
You should run follow command to apply allocation id changes:

$ curl -XPOST 'http://localhost:9200/_cluster/reroute' -d '
{
"commands" : [
{
"allocate_stale_primary" : {
"index" : "twitter",
"shard" : 0,
"node" : "pAfJBgAAQACIfI2M_____w",
"accept_data_loss" : true
}
}
]
}'
Deleted corrupt marker corrupted_cJv5hCxeTE2p3AucpgCJyg

--------------------------------------------------

You can also use the `-h` option to get a list of all options and parameters
that the `elasticsearch-shard` tool supports.
57 changes: 1 addition & 56 deletions docs/reference/index-modules/translog.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -88,59 +88,4 @@ file based sync. Defaults to `512mb`
The maximum duration for which translog files will be kept. Defaults to `12h`.


[float]
[[corrupt-translog-truncation]]
=== What to do if the translog becomes corrupted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid having to set up redirects & help steer people to the new tool, I'd keep this section heading and just xref the new one.

In some cases (a bad drive, user error) the translog on a shard copy can become
corrupted. When this corruption is detected by Elasticsearch due to mismatching
checksums, Elasticsearch will fail that shard copy and refuse to use that copy
of the data. If there are other copies of the shard available then
Elasticsearch will automatically recover from one of them using the normal
shard allocation and recovery mechanism. In particular, if the corrupt shard
copy was the primary when the corruption was detected then one of its replicas
will be promoted in its place.

If there is no copy of the data from which Elasticsearch can recover
successfully, a user may want to recover the data that is part of the shard at
the cost of losing the data that is currently contained in the translog. We
provide a command-line tool for this, `elasticsearch-translog`.

[WARNING]
The `elasticsearch-translog` tool should *not* be run while Elasticsearch is
running. If you attempt to run this tool while Elasticsearch is running, you
will permanently lose the documents that were contained only in the translog!

In order to run the `elasticsearch-translog` tool, specify the `truncate`
subcommand as well as the directory for the corrupted translog with the `-d`
option:

[source,txt]
--------------------------------------------------
$ bin/elasticsearch-translog truncate -d /var/lib/elasticsearchdata/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/
Checking existing translog files
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
! WARNING: Elasticsearch MUST be stopped before running this tool !
! !
! WARNING: Documents inside of translog files will be lost !
! !
! WARNING: The following files will be DELETED! !
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-41.ckp
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-6.ckp
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-37.ckp
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-24.ckp
--> data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-11.ckp

Continue and DELETE files? [y/N] y
Reading translog UUID information from Lucene commit from shard at [data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/index]
Translog Generation: 3
Translog UUID : AxqC4rocTC6e0fwsljAh-Q
Removing existing translog files
Creating new empty checkpoint at [data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog.ckp]
Creating new empty translog at [data/nodes/0/indices/P45vf_YQRhqjfwLMUvSqDw/0/translog/translog-3.tlog]
Done.
--------------------------------------------------

You can also use the `-h` option to get a list of all options and parameters
that the `elasticsearch-translog` tool supports.
[float]
9 changes: 9 additions & 0 deletions docs/reference/migration/migrate_7_0/indices.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,12 @@ The parent circuit breaker defines a new setting `indices.breaker.total.use_real
heap memory instead of only considering the reserved memory by child circuit breakers. When this
setting is `true`, the default parent breaker limit also changes from 70% to 95% of the JVM heap size.
The previous behavior can be restored by setting `indices.breaker.total.use_real_memory` to `false`.

==== `fix` value for `index.shard.check_on_startup` is removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we planned on documenting this breakage in the 6.0 breaking changes, and its removal in the backport of #32279. In any case I think this shouldn't be part of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++


`elasticsearch-shard remove-corrupted-segments` tool has to be used instead of
`index.shard.check_on_startup: fix` setting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's consistently refer to it as the "elasticsearch-shard" tool. I'd just say:

Use the elasticsearch-shard tool to remove corrupted Lucene index segments.


==== `elasticsearch-translog` tool merged into `elasticsearch-shard`

Instead of `elasticsearch-translog` tool you should use `elasticsearch-shard truncate-translog`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly:

Use the elasticsearch-shard tool to remove corrupted translog data.

7 changes: 6 additions & 1 deletion libs/cli/src/main/java/org/elasticsearch/cli/Terminal.java
Original file line number Diff line number Diff line change
Expand Up @@ -85,12 +85,17 @@ public final void println(Verbosity verbosity, String msg) {

/** Prints message to the terminal at {@code verbosity} level, without a newline. */
public final void print(Verbosity verbosity, String msg) {
if (this.verbosity.ordinal() >= verbosity.ordinal()) {
if (isPrintable(verbosity)) {
getWriter().print(msg);
getWriter().flush();
}
}

/** Checks if is enough {@code verbosity} level to be printed */
public final boolean isPrintable(Verbosity verbosity) {
return this.verbosity.ordinal() >= verbosity.ordinal();
}

/**
* Prompt for a yes or no answer from the user. This method will loop until 'y' or 'n'
* (or the default empty value) is entered.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -325,4 +325,21 @@ public void test90SecurityCliPackaging() {
}
}

public void test100RepairIndexCliPackaging() {
assumeThat(installation, is(notNullValue()));

final Installation.Executables bin = installation.executables();
final Shell sh = new Shell();

Platforms.PlatformAction action = () -> {
final Result result = sh.run(bin.elasticsearchShard + " help");
assertThat(result.stdout, containsString("A CLI tool to manage shard"));
};

if (distribution().equals(Distribution.DEFAULT_TAR) || distribution().equals(Distribution.DEFAULT_ZIP)) {
Platforms.onLinux(action);
Platforms.onWindows(action);
}
}

}
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ private static void verifyOssInstallation(Installation es, Distribution distribu
"elasticsearch-env",
"elasticsearch-keystore",
"elasticsearch-plugin",
"elasticsearch-translog"
"elasticsearch-shard"
).forEach(executable -> {

assertThat(es.bin(executable), file(File, owner, owner, p755));
Expand Down
Loading