From 671027919939b8a84b1fc246d3477abb85e7db97 Mon Sep 17 00:00:00 2001 From: debadair Date: Fri, 4 Oct 2019 19:41:46 -0700 Subject: [PATCH] Reformats reindex API (#47483) * Reformats reindex API * Incorporated review feedback. --- docs/reference/docs/reindex.asciidoc | 1542 ++++++++++++-------------- 1 file changed, 719 insertions(+), 823 deletions(-) diff --git a/docs/reference/docs/reindex.asciidoc b/docs/reference/docs/reindex.asciidoc index 2f6d1561e67c1..009c8deb7785c 100644 --- a/docs/reference/docs/reindex.asciidoc +++ b/docs/reference/docs/reindex.asciidoc @@ -1,16 +1,20 @@ [[docs-reindex]] === Reindex API +++++ +Reindex +++++ -IMPORTANT: Reindex requires <> to be enabled for -all documents in the source index. +Copies documents from one index to another. -IMPORTANT: Reindex does not attempt to set up the destination index. It does -not copy the settings of the source index. You should set up the destination -index prior to running a `_reindex` action, including setting up mappings, shard -counts, replicas, etc. +[IMPORTANT] +================================================= +Reindex requires <> to be enabled for +all documents in the source index. -The most basic form of `_reindex` just copies documents from one index to another. -This will copy documents from the `twitter` index into the `new_twitter` index: +You must set up the destination index before calling `_reindex`. +Reindex does not copy the settings from the source index. +Mappings, shard counts, replicas, and so on must be configured ahead of time. +================================================= [source,console] -------------------------------------------------- @@ -26,7 +30,7 @@ POST _reindex -------------------------------------------------- // TEST[setup:big_twitter] -That will return something like this: +//// [source,console-result] -------------------------------------------------- @@ -52,145 +56,201 @@ That will return something like this: -------------------------------------------------- // TESTRESPONSE[s/"took" : 147/"took" : "$body.took"/] +//// + +[[docs-reindex-api-request]] +==== {api-request-title} + +`POST /_reindex` + +[[docs-reindex-api-desc]] +==== {api-description-title} + +Extracts the document source from the source index and indexes the documents into the destination index. +You can copy all documents to the destination index, or reindex a subset of the documents. + Just like <>, `_reindex` gets a snapshot of the source index but its target must be a **different** index so version conflicts are unlikely. The `dest` element can be configured like the -index API to control optimistic concurrency control. Just leaving out -`version_type` (as above) or setting it to `internal` will cause Elasticsearch +index API to control optimistic concurrency control. Omitting +`version_type` or setting it to `internal` causes Elasticsearch to blindly dump documents into the target, overwriting any that happen to have -the same type and id: - -[source,console] --------------------------------------------------- -POST _reindex -{ - "source": { - "index": "twitter" - }, - "dest": { - "index": "new_twitter", - "version_type": "internal" - } -} --------------------------------------------------- -// TEST[setup:twitter] +the same ID. -Setting `version_type` to `external` will cause Elasticsearch to preserve the +Setting `version_type` to `external` causes Elasticsearch to preserve the `version` from the source, create any documents that are missing, and update any documents that have an older version in the destination index than they do -in the source index: +in the source index. -[source,console] +Setting `op_type` to `create` causes `_reindex` to only create missing +documents in the target index. All existing documents will cause a version +conflict. + +By default, version conflicts abort the `_reindex` process. +To continue reindexing if there are conflicts, set the `"conflicts"` request body parameter to `proceed`. +In this case, the response includes a count of the version conflicts that were encountered. +Note that the handling of other error types is unaffected by the `"conflicts"` parameter. + +[[docs-reindex-task-api]] +===== Running reindex asynchronously + +If the request contains `wait_for_completion=false`, {es} +performs some preflight checks, launches the request, and returns a +<> you can use to cancel or get the status of the task. +{es} creates a record of this task as a document at `.tasks/task/${taskId}`. +When you are done with a task, you should delete the task document so +{es} can reclaim the space. + +[[docs-reindex-many-indices]] +===== Reindexing many indices +If you have many indices to reindex it is generally better to reindex them +one at a time rather than using a glob pattern to pick up many indices. That +way you can resume the process if there are any errors by removing the +partially completed index and starting over at that index. It also makes +parallelizing the process fairly simple: split the list of indices to reindex +and run each list in parallel. + +One-off bash scripts seem to work nicely for this: + +[source,bash] +---------------------------------------------------------------- +for index in i1 i2 i3 i4 i5; do + curl -HContent-Type:application/json -XPOST localhost:9200/_reindex?pretty -d'{ + "source": { + "index": "'$index'" + }, + "dest": { + "index": "'$index'-reindexed" + } + }' +done +---------------------------------------------------------------- +// NOTCONSOLE + +[[docs-reindex-throttle]] +===== Throttling + +Set `requests_per_second` to any positive decimal number (`1.4`, `6`, +`1000`, etc.) to throttle the rate at which `_reindex` issues batches of index +operations. Requests are throttled by padding each batch with a wait time. +To disable throttling, set `requests_per_second` to `-1`. + +The throttling is done by waiting between batches so that the `scroll` that `_reindex` +uses internally can be given a timeout that takes into account the padding. +The padding time is the difference between the batch size divided by the +`requests_per_second` and the time spent writing. By default the batch size is +`1000`, so if `requests_per_second` is set to `500`: + +[source,txt] -------------------------------------------------- -POST _reindex -{ - "source": { - "index": "twitter" - }, - "dest": { - "index": "new_twitter", - "version_type": "external" - } -} +target_time = 1000 / 500 per second = 2 seconds +wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds -------------------------------------------------- -// TEST[setup:twitter] -Settings `op_type` to `create` will cause `_reindex` to only create missing -documents in the target index. All existing documents will cause a version -conflict: +Since the batch is issued as a single `_bulk` request, large batch sizes +cause Elasticsearch to create many requests and then wait for a while before +starting the next set. This is "bursty" instead of "smooth". + +[[docs-reindex-rethrottle]] +===== Rethrottling + +The value of `requests_per_second` can be changed on a running reindex using +the `_rethrottle` API: [source,console] -------------------------------------------------- -POST _reindex -{ - "source": { - "index": "twitter" - }, - "dest": { - "index": "new_twitter", - "op_type": "create" - } -} +POST _reindex/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1 -------------------------------------------------- -// TEST[setup:twitter] -By default, version conflicts abort the `_reindex` process. The `"conflicts"` request body -parameter can be used to instruct `_reindex` to proceed with the next document on version conflicts. -It is important to note that the handling of other error types is unaffected by the `"conflicts"` parameter. -When `"conflicts": "proceed"` is set in the request body, the `_reindex` process will continue on version conflicts -and return a count of version conflicts encountered: +The task ID can be found using the <>. + +Just like when setting it on the Reindex API, `requests_per_second` +can be either `-1` to disable throttling or any decimal number +like `1.7` or `12` to throttle to that level. Rethrottling that speeds up the +query takes effect immediately, but rethrottling that slows down the query will +take effect after completing the current batch. This prevents scroll +timeouts. + +[[docs-reindex-slice]] +===== Slicing + +Reindex supports <> to parallelize the reindexing process. +This parallelization can improve efficiency and provide a convenient way to +break the request down into smaller parts. + +NOTE: Reindexing from remote clusters does not support +<> or +<>. + +[[docs-reindex-manual-slice]] +====== Manual slicing +Slice a reindex request manually by providing a slice id and total number of +slices to each request: [source,console] --------------------------------------------------- +---------------------------------------------------------------- POST _reindex { - "conflicts": "proceed", "source": { - "index": "twitter" + "index": "twitter", + "slice": { + "id": 0, + "max": 2 + } }, "dest": { - "index": "new_twitter", - "op_type": "create" + "index": "new_twitter" } } --------------------------------------------------- -// TEST[setup:twitter] - -You can limit the documents by adding a query to the `source`. -This will only copy tweets made by `kimchy` into `new_twitter`: - -[source,console] --------------------------------------------------- POST _reindex { "source": { "index": "twitter", - "query": { - "term": { - "user": "kimchy" - } + "slice": { + "id": 1, + "max": 2 } }, "dest": { "index": "new_twitter" } } --------------------------------------------------- -// TEST[setup:twitter] +---------------------------------------------------------------- +// TEST[setup:big_twitter] -`index` in `source` can be a list, allowing you to copy from lots -of sources in one request. This will copy documents from the -`twitter` and `blog` indices: +You can verify this works by: [source,console] --------------------------------------------------- -POST _reindex +---------------------------------------------------------------- +GET _refresh +POST new_twitter/_search?size=0&filter_path=hits.total +---------------------------------------------------------------- +// TEST[continued] + +which results in a sensible `total` like this one: + +[source,console-result] +---------------------------------------------------------------- { - "source": { - "index": ["twitter", "blog"] - }, - "dest": { - "index": "all_together" + "hits": { + "total" : { + "value": 120, + "relation": "eq" + } } } --------------------------------------------------- -// TEST[setup:twitter] -// TEST[s/^/PUT blog\/post\/post1?refresh\n{"test": "foo"}\n/] +---------------------------------------------------------------- -NOTE: The Reindex API makes no effort to handle ID collisions so the last -document written will "win" but the order isn't usually predictable so it is -not a good idea to rely on this behavior. Instead, make sure that IDs are unique -using a script. +[[docs-reindex-automatic-slice]] +====== Automatic slicing -It's also possible to limit the number of processed documents by setting -`max_docs`. This will only copy a single document from `twitter` to -`new_twitter`: +You can also let `_reindex` automatically parallelize using <> to +slice on `_uid`. Use `slices` to specify the number of slices to use: [source,console] --------------------------------------------------- -POST _reindex +---------------------------------------------------------------- +POST _reindex?slices=5&refresh { - "max_docs": 1, "source": { "index": "twitter" }, @@ -198,104 +258,80 @@ POST _reindex "index": "new_twitter" } } --------------------------------------------------- -// TEST[setup:twitter] +---------------------------------------------------------------- +// TEST[setup:big_twitter] -If you want a particular set of documents from the `twitter` index you'll -need to use `sort`. Sorting makes the scroll less efficient but in some contexts -it's worth it. If possible, prefer a more selective query to `max_docs` and `sort`. -This will copy 10000 documents from `twitter` into `new_twitter`: +You can also this verify works by: [source,console] --------------------------------------------------- -POST _reindex -{ - "max_docs": 10000, - "source": { - "index": "twitter", - "sort": { "date": "desc" } - }, - "dest": { - "index": "new_twitter" - } -} --------------------------------------------------- -// TEST[setup:twitter] +---------------------------------------------------------------- +POST new_twitter/_search?size=0&filter_path=hits.total +---------------------------------------------------------------- +// TEST[continued] -The `source` section supports all the elements that are supported in a -<>. For instance, only a subset of the -fields from the original documents can be reindexed using `source` filtering -as follows: +which results in a sensible `total` like this one: -[source,console] --------------------------------------------------- -POST _reindex +[source,console-result] +---------------------------------------------------------------- { - "source": { - "index": "twitter", - "_source": ["user", "_doc"] - }, - "dest": { - "index": "new_twitter" + "hits": { + "total" : { + "value": 120, + "relation": "eq" + } } } --------------------------------------------------- -// TEST[setup:twitter] - -[[reindex-scripts]] -Like `_update_by_query`, `_reindex` supports a script that modifies the -document. Unlike `_update_by_query`, the script is allowed to modify the -document's metadata. This example bumps the version of the source document: +---------------------------------------------------------------- -[source,console] --------------------------------------------------- -POST _reindex -{ - "source": { - "index": "twitter" - }, - "dest": { - "index": "new_twitter", - "version_type": "external" - }, - "script": { - "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}", - "lang": "painless" - } -} --------------------------------------------------- -// TEST[setup:twitter] +Setting `slices` to `auto` will let Elasticsearch choose the number of slices +to use. This setting will use one slice per shard, up to a certain limit. If +there are multiple source indices, it will choose the number of slices based +on the index with the smallest number of shards. -Just as in `_update_by_query`, you can set `ctx.op` to change the -operation that is executed on the destination index: +Adding `slices` to `_reindex` just automates the manual process used in the +section above, creating sub-requests which means it has some quirks: -`noop`:: - -Set `ctx.op = "noop"` if your script decides that the document doesn't have -to be indexed in the destination index. This no operation will be reported -in the `noop` counter in the <>. +* You can see these requests in the <>. These +sub-requests are "child" tasks of the task for the request with `slices`. +* Fetching the status of the task for the request with `slices` only contains +the status of completed slices. +* These sub-requests are individually addressable for things like cancelation +and rethrottling. +* Rethrottling the request with `slices` will rethrottle the unfinished +sub-request proportionally. +* Canceling the request with `slices` will cancel each sub-request. +* Due to the nature of `slices` each sub-request won't get a perfectly even +portion of the documents. All documents will be addressed, but some slices may +be larger than others. Expect larger slices to have a more even distribution. +* Parameters like `requests_per_second` and `max_docs` on a request with +`slices` are distributed proportionally to each sub-request. Combine that with +the point above about distribution being uneven and you should conclude that +using `max_docs` with `slices` might not result in exactly `max_docs` documents +being reindexed. +* Each sub-request gets a slightly different snapshot of the source index, +though these are all taken at approximately the same time. -`delete`:: +[[docs-reindex-picking-slices]] +====== Picking the number of slices -Set `ctx.op = "delete"` if your script decides that the document must be - deleted from the destination index. The deletion will be reported in the - `deleted` counter in the <>. +If slicing automatically, setting `slices` to `auto` will choose a reasonable +number for most indices. If slicing manually or otherwise tuning +automatic slicing, use these guidelines. -Setting `ctx.op` to anything else will return an error, as will setting any -other field in `ctx`. +Query performance is most efficient when the number of `slices` is equal to the +number of shards in the index. If that number is large (e.g. 500), +choose a lower number as too many `slices` will hurt performance. Setting +`slices` higher than the number of shards generally does not improve efficiency +and adds overhead. -Think of the possibilities! Just be careful; you are able to -change: +Indexing performance scales linearly across available resources with the +number of slices. - * `_id` - * `_index` - * `_version` - * `_routing` +Whether query or indexing performance dominates the runtime depends on the +documents being reindexed and cluster resources. -Setting `_version` to `null` or clearing it from the `ctx` map is just like not -sending the version in an indexing request; it will cause the document to be -overwritten in the target index regardless of the version on the target or the -version type you use in the `_reindex` request. +[[docs-reindex-routing]] +===== Reindex routing By default if `_reindex` sees a document with routing then the routing is preserved unless it's changed by the script. You can set `routing` on the @@ -339,6 +375,8 @@ POST _reindex -------------------------------------------------- // TEST[s/^/PUT source\n/] + + By default `_reindex` uses scroll batches of 1000. You can change the batch size with the `size` field in the `source` element: @@ -376,289 +414,278 @@ POST _reindex -------------------------------------------------- // TEST[s/^/PUT source\n/] -[float] -[[reindex-from-remote]] -==== Reindex from Remote +[[docs-reindex-api-query-params]] +==== {api-query-parms-title} -Reindex supports reindexing from a remote Elasticsearch cluster: +include::{docdir}/rest-api/common-parms.asciidoc[tag=refresh] + +include::{docdir}/rest-api/common-parms.asciidoc[tag=timeout] + +include::{docdir}/rest-api/common-parms.asciidoc[tag=wait_for_active_shards] + +include::{docdir}/rest-api/common-parms.asciidoc[tag=wait_for_completion] + +include::{docdir}/rest-api/common-parms.asciidoc[tag=requests_per_second] + +include::{docdir}/rest-api/common-parms.asciidoc[tag=scroll] + +include::{docdir}/rest-api/common-parms.asciidoc[tag=slices] + +include::{docdir}/rest-api/common-parms.asciidoc[tag=max_docs] + +[[docs-reindex-api-request-body]] +==== {api-request-body-title} + +`conflicts`:: +(Optional, enum) Set to `proceed` to continue reindexing even if there are conflicts. +Defaults to `abort`. + +`source`:: +`index`::: +(Required, string) The name of the index you are copying _from_. +Also accepts a comma-separated list of indices to reindex from multiple sources. + +`max_docs`::: +(Optional, integer) The maximum number of documents to reindex. + +`query`::: +(Optional, <>) Specifies the documents to reindex using the Query DSL. + +`remote`::: +`host`:::: +(Optional, string) The URL for the remote instance of {es} that you want to index _from_. +Required when indexing from remote. +`username`:::: +(Optional, string) The username to use for authentication with the remote host. +`password`:::: +(Optional, string) The password to use for authentication with the remote host. +`socket_timeout`:::: +(Optional, <>) The remote socket read timeout. Defaults to 30 seconds. +`connect_timeout`:::: +(Optional, <>) The remote connection timeout. Defaults to 30 seconds. + +`size`::: +{Optional, integer) The number of documents to index per batch. +Use when indexing from remote to ensure that the batches fit within the on-heap buffer, +which defaults to a maximum size of 100 MB. + +`slice`::: +`id`:::: +(Optional, integer) Slice ID for <>. +`max`:::: +(Optional, integer) Total number of slices. + +`sort`::: +(Optional, list) A comma-separated list of `:` pairs to sort by before indexing. +Use in conjunction with `max_docs` to control what documents are reindexed. + +`_source`::: +(Optional, string) If `true` reindexes all source fields. +Set to a list to reindex select fields. +Defaults to `true`. + +`dest`:: +`index`::: +(Required, string) The name of the index you are copying _to_. + +`version_type`::: +(Optional, enum) The versioning to use for the indexing operation. +Valid values: `internal`, `external`, `external_gt`, `external_gte`. +See <> for more information. + +`op_type`::: +(Optional, enum) Set to create to only index documents that do not already exist (put if absent). +Valid values: `index`, `create`. Defaults to `index`. + +`script`:: +`source`::: +(Optional, string) The script to run to update the document source or metadata when reindexing. +`lang`::: +(Optional, enum) The script language: `painless`, `expression`, `mustache`, `java`. +For more information, see <>. + + +[[docs-reindex-api-response-body]] +==== {api-response-body-title} + +`took`:: + +(integer) The total milliseconds the entire operation took. + +`timed_out`:: + +{boolean) This flag is set to `true` if any of the requests executed during the +reindex timed out. + +`total`:: + +(integer) The number of documents that were successfully processed. + +`updated`:: + +(integer) The number of documents that were successfully updated. + +`created`:: + +(integer) The number of documents that were successfully created. + +`deleted`:: + +(integer) The number of documents that were successfully deleted. + +`batches`:: + +(integer) The number of scroll responses pulled back by the reindex. + +`noops`:: + +(integer) The number of documents that were ignored because the script used for +the reindex returned a `noop` value for `ctx.op`. + +`version_conflicts`:: + +{integer)The number of version conflicts that reindex hit. + +`retries`:: + +(integer) The number of retries attempted by reindex. `bulk` is the number of bulk +actions retried and `search` is the number of search actions retried. + +`throttled_millis`:: + +(integer) Number of milliseconds the request slept to conform to `requests_per_second`. + +`requests_per_second`:: + +(integer) The number of requests per second effectively executed during the reindex. + +`throttled_until_millis`:: + +(integer) This field should always be equal to zero in a `_reindex` response. It only +has meaning when using the <>, where it +indicates the next time (in milliseconds since epoch) a throttled request will be +executed again in order to conform to `requests_per_second`. + +`failures`:: + +(array) Array of failures if there were any unrecoverable errors during the process. If +this is non-empty then the request aborted because of those failures. Reindex +is implemented using batches and any failure causes the entire process to abort +but all failures in the current batch are collected into the array. You can use +the `conflicts` option to prevent reindex from aborting on version conflicts. + +[[docs-reindex-api-example]] +==== {api-examples-title} + +[[docs-reindex-select-query]] +===== Reindex select documents with a query + +You can limit the documents by adding a query to the `source`. +For example, the following request only copies tweets made by `kimchy` into `new_twitter`: [source,console] -------------------------------------------------- POST _reindex { "source": { - "remote": { - "host": "http://otherhost:9200", - "username": "user", - "password": "pass" - }, - "index": "source", + "index": "twitter", "query": { - "match": { - "test": "data" + "term": { + "user": "kimchy" } } }, "dest": { - "index": "dest" + "index": "new_twitter" } } -------------------------------------------------- -// TEST[setup:host] -// TEST[s/^/PUT source\n/] -// TEST[s/otherhost:9200",/\${host}"/] -// TEST[s/"username": "user",//] -// TEST[s/"password": "pass"//] - -The `host` parameter must contain a scheme, host, port (e.g. -`https://otherhost:9200`), and optional path (e.g. `https://otherhost:9200/proxy`). -The `username` and `password` parameters are optional, and when they are present `_reindex` -will connect to the remote Elasticsearch node using basic auth. Be sure to use `https` when -using basic auth or the password will be sent in plain text. -There are a range of <> available to configure the behaviour of the - `https` connection. +// TEST[setup:twitter] -Remote hosts have to be explicitly whitelisted in elasticsearch.yml using the -`reindex.remote.whitelist` property. It can be set to a comma delimited list -of allowed remote `host` and `port` combinations (e.g. -`otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*`). Scheme is -ignored by the whitelist -- only host and port are used, for example: +[[docs-reindex-select-sort]] +===== Reindex select documents with sort +You can limit the number of processed documents by setting `max_docs`. +For example, this request copies a single document from `twitter` to +`new_twitter`: -[source,yaml] +[source,console] -------------------------------------------------- -reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*" +POST _reindex +{ + "max_docs": 1, + "source": { + "index": "twitter" + }, + "dest": { + "index": "new_twitter" + } +} -------------------------------------------------- +// TEST[setup:twitter] -The whitelist must be configured on any nodes that will coordinate the reindex. - -This feature should work with remote clusters of any version of Elasticsearch -you are likely to find. This should allow you to upgrade from any version of -Elasticsearch to the current version by reindexing from a cluster of the old -version. - -To enable queries sent to older versions of Elasticsearch the `query` parameter -is sent directly to the remote host without validation or modification. - -NOTE: Reindexing from remote clusters does not support -<> or -<>. +You can use `sort` in conjunction with `max_docs` to select the documents you want to reindex. +Sorting makes the scroll less efficient but in some contexts it's worth it. +If possible, it's better to use a more selective query instead of `max_docs` and `sort`. -Reindexing from a remote server uses an on-heap buffer that defaults to a -maximum size of 100mb. If the remote index includes very large documents you'll -need to use a smaller batch size. The example below sets the batch size to `10` -which is very, very small. +For example, following request copies 10000 documents from `twitter` into `new_twitter`: [source,console] -------------------------------------------------- POST _reindex { + "max_docs": 10000, "source": { - "remote": { - "host": "http://otherhost:9200" - }, - "index": "source", - "size": 10, - "query": { - "match": { - "test": "data" - } - } + "index": "twitter", + "sort": { "date": "desc" } }, "dest": { - "index": "dest" + "index": "new_twitter" } } -------------------------------------------------- -// TEST[setup:host] -// TEST[s/^/PUT source\n/] -// TEST[s/otherhost:9200/\${host}/] +// TEST[setup:twitter] -It is also possible to set the socket read timeout on the remote connection -with the `socket_timeout` field and the connection timeout with the -`connect_timeout` field. Both default to 30 seconds. This example -sets the socket read timeout to one minute and the connection timeout to 10 -seconds: +[[docs-reindex-multiple-indices]] +===== Reindex from multiple indices + +The `index` attribute in `source` can be a list, allowing you to copy from lots +of sources in one request. This will copy documents from the +`twitter` and `blog` indices: [source,console] -------------------------------------------------- POST _reindex { "source": { - "remote": { - "host": "http://otherhost:9200", - "socket_timeout": "1m", - "connect_timeout": "10s" - }, - "index": "source", - "query": { - "match": { - "test": "data" - } - } + "index": ["twitter", "blog"] }, "dest": { - "index": "dest" + "index": "all_together" } } -------------------------------------------------- -// TEST[setup:host] -// TEST[s/^/PUT source\n/] -// TEST[s/otherhost:9200/\${host}/] +// TEST[setup:twitter] +// TEST[s/^/PUT blog\/post\/post1?refresh\n{"test": "foo"}\n/] -[float] -[[reindex-ssl]] -===== Configuring SSL parameters +NOTE: The Reindex API makes no effort to handle ID collisions so the last +document written will "win" but the order isn't usually predictable so it is +not a good idea to rely on this behavior. Instead, make sure that IDs are unique +using a script. -Reindex from remote supports configurable SSL settings. These must be -specified in the `elasticsearch.yml` file, with the exception of the -secure settings, which you add in the Elasticsearch keystore. -It is not possible to configure SSL in the body of the `_reindex` request. +[[docs-reindex-filter-source]] +===== Reindex select fields with a source filter -The following settings are supported: +You can use source filtering to reindex a subset of the fields in the original documents. +For example, the following request only reindexes the `user` and `_doc` fields of each document: -`reindex.ssl.certificate_authorities`:: -List of paths to PEM encoded certificate files that should be trusted. -You cannot specify both `reindex.ssl.certificate_authorities` and -`reindex.ssl.truststore.path`. - -`reindex.ssl.truststore.path`:: -The path to the Java Keystore file that contains the certificates to trust. -This keystore can be in "JKS" or "PKCS#12" format. -You cannot specify both `reindex.ssl.certificate_authorities` and -`reindex.ssl.truststore.path`. - -`reindex.ssl.truststore.password`:: -The password to the truststore (`reindex.ssl.truststore.path`). -This setting cannot be used with `reindex.ssl.truststore.secure_password`. - -`reindex.ssl.truststore.secure_password` (<>):: -The password to the truststore (`reindex.ssl.truststore.path`). -This setting cannot be used with `reindex.ssl.truststore.password`. - -`reindex.ssl.truststore.type`:: -The type of the truststore (`reindex.ssl.truststore.path`). -Must be either `jks` or `PKCS12`. If the truststore path ends in ".p12", ".pfx" -or "pkcs12", this setting defaults to `PKCS12`. Otherwise, it defaults to `jks`. - -`reindex.ssl.verification_mode`:: -Indicates the type of verification to protect against man in the middle attacks -and certificate forgery. -One of `full` (verify the hostname and the certificate path), `certificate` -(verify the certificate path, but not the hostname) or `none` (perform no -verification - this is strongly discouraged in production environments). -Defaults to `full`. - -`reindex.ssl.certificate`:: -Specifies the path to the PEM encoded certificate (or certificate chain) to be -used for HTTP client authentication (if required by the remote cluster) -This setting requires that `reindex.ssl.key` also be set. -You cannot specify both `reindex.ssl.certificate` and `reindex.ssl.keystore.path`. - -`reindex.ssl.key`:: -Specifies the path to the PEM encoded private key associated with the -certificate used for client authentication (`reindex.ssl.certificate`). -You cannot specify both `reindex.ssl.key` and `reindex.ssl.keystore.path`. - -`reindex.ssl.key_passphrase`:: -Specifies the passphrase to decrypt the PEM encoded private key -(`reindex.ssl.key`) if it is encrypted. -Cannot be used with `reindex.ssl.secure_key_passphrase`. - -`reindex.ssl.secure_key_passphrase` (<>):: -Specifies the passphrase to decrypt the PEM encoded private key -(`reindex.ssl.key`) if it is encrypted. -Cannot be used with `reindex.ssl.key_passphrase`. - -`reindex.ssl.keystore.path`:: -Specifies the path to the keystore that contains a private key and certificate -to be used for HTTP client authentication (if required by the remote cluster). -This keystore can be in "JKS" or "PKCS#12" format. -You cannot specify both `reindex.ssl.key` and `reindex.ssl.keystore.path`. - -`reindex.ssl.keystore.type`:: -The type of the keystore (`reindex.ssl.keystore.path`). Must be either `jks` or `PKCS12`. -If the keystore path ends in ".p12", ".pfx" or "pkcs12", this setting defaults -to `PKCS12`. Otherwise, it defaults to `jks`. - -`reindex.ssl.keystore.password`:: -The password to the keystore (`reindex.ssl.keystore.path`). This setting cannot be used -with `reindex.ssl.keystore.secure_password`. - -`reindex.ssl.keystore.secure_password` (<>):: -The password to the keystore (`reindex.ssl.keystore.path`). -This setting cannot be used with `reindex.ssl.keystore.password`. - -`reindex.ssl.keystore.key_password`:: -The password for the key in the keystore (`reindex.ssl.keystore.path`). -Defaults to the keystore password. This setting cannot be used with -`reindex.ssl.keystore.secure_key_password`. - -`reindex.ssl.keystore.secure_key_password` (<>):: -The password for the key in the keystore (`reindex.ssl.keystore.path`). -Defaults to the keystore password. This setting cannot be used with -`reindex.ssl.keystore.key_password`. - -[float] -==== URL Parameters - -In addition to the standard parameters like `pretty`, the Reindex API also -supports `refresh`, `wait_for_completion`, `wait_for_active_shards`, `timeout`, -`scroll`, and `requests_per_second`. - -Sending the `refresh` url parameter will cause all indexes to which the request -wrote to be refreshed. This is different than the Index API's `refresh` -parameter which causes just the shard that received the new data to be -refreshed. Also unlike the Index API it does not support `wait_for`. - -If the request contains `wait_for_completion=false` then Elasticsearch will -perform some preflight checks, launch the request, and then return a `task` -which can be used with <> -to cancel or get the status of the task. Elasticsearch will also create a -record of this task as a document at `.tasks/task/${taskId}`. This is yours -to keep or remove as you see fit. When you are done with it, delete it so -Elasticsearch can reclaim the space it uses. - -`wait_for_active_shards` controls how many copies of a shard must be active -before proceeding with the reindexing. See <> -for details. `timeout` controls how long each write request waits for unavailable -shards to become available. Both work exactly how they work in the -<>. As `_reindex` uses scroll search, you can also specify -the `scroll` parameter to control how long it keeps the "search context" alive, -(e.g. `?scroll=10m`). The default value is 5 minutes. - -`requests_per_second` can be set to any positive decimal number (`1.4`, `6`, -`1000`, etc.) and throttles the rate at which `_reindex` issues batches of index -operations by padding each batch with a wait time. The throttling can be -disabled by setting `requests_per_second` to `-1`. - -The throttling is done by waiting between batches so that the `scroll` which `_reindex` -uses internally can be given a timeout that takes into account the padding. -The padding time is the difference between the batch size divided by the -`requests_per_second` and the time spent writing. By default the batch size is -`1000`, so if the `requests_per_second` is set to `500`: - -[source,txt] --------------------------------------------------- -target_time = 1000 / 500 per second = 2 seconds -wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds --------------------------------------------------- - -Since the batch is issued as a single `_bulk` request, large batch sizes will -cause Elasticsearch to create many requests and then wait for a while before -starting the next set. This is "bursty" instead of "smooth". The default value is `-1`. - -[float] -[[docs-reindex-response-body]] -==== Response body - -////////////////////////// [source,console] -------------------------------------------------- -POST /_reindex?wait_for_completion +POST _reindex { "source": { - "index": "twitter" + "index": "twitter", + "_source": ["user", "_doc"] }, "dest": { "index": "new_twitter" @@ -667,227 +694,8 @@ POST /_reindex?wait_for_completion -------------------------------------------------- // TEST[setup:twitter] -////////////////////////// - -The JSON response looks like this: - -[source,console-result] --------------------------------------------------- -{ - "took": 639, - "timed_out": false, - "total": 5, - "updated": 0, - "created": 5, - "deleted": 0, - "batches": 1, - "noops": 0, - "version_conflicts": 2, - "retries": { - "bulk": 0, - "search": 0 - }, - "throttled_millis": 0, - "requests_per_second": 1, - "throttled_until_millis": 0, - "failures": [ ] -} --------------------------------------------------- -// TESTRESPONSE[s/: [0-9]+/: $body.$_path/] - -`took`:: - -The total milliseconds the entire operation took. - -`timed_out`:: - -This flag is set to `true` if any of the requests executed during the -reindex timed out. - -`total`:: - -The number of documents that were successfully processed. - -`updated`:: - -The number of documents that were successfully updated. - -`created`:: - -The number of documents that were successfully created. - -`deleted`:: - -The number of documents that were successfully deleted. - -`batches`:: - -The number of scroll responses pulled back by the reindex. - -`noops`:: - -The number of documents that were ignored because the script used for -the reindex returned a `noop` value for `ctx.op`. - -`version_conflicts`:: - -The number of version conflicts that reindex hit. - -`retries`:: - -The number of retries attempted by reindex. `bulk` is the number of bulk -actions retried and `search` is the number of search actions retried. - -`throttled_millis`:: - -Number of milliseconds the request slept to conform to `requests_per_second`. - -`requests_per_second`:: - -The number of requests per second effectively executed during the reindex. - -`throttled_until_millis`:: - -This field should always be equal to zero in a `_reindex` response. It only -has meaning when using the <>, where it -indicates the next time (in milliseconds since epoch) a throttled request will be -executed again in order to conform to `requests_per_second`. - -`failures`:: - -Array of failures if there were any unrecoverable errors during the process. If -this is non-empty then the request aborted because of those failures. Reindex -is implemented using batches and any failure causes the entire process to abort -but all failures in the current batch are collected into the array. You can use -the `conflicts` option to prevent reindex from aborting on version conflicts. - -[float] -[[docs-reindex-task-api]] -==== Works with the Task API - -You can fetch the status of all running reindex requests with the -<>: - -[source,console] --------------------------------------------------- -GET _tasks?detailed=true&actions=*reindex --------------------------------------------------- -// TEST[skip:No tasks to retrieve] - -The response looks like: - -[source,console-result] --------------------------------------------------- -{ - "nodes" : { - "r1A2WoRbTwKZ516z6NEs5A" : { - "name" : "r1A2WoR", - "transport_address" : "127.0.0.1:9300", - "host" : "127.0.0.1", - "ip" : "127.0.0.1:9300", - "attributes" : { - "testattr" : "test", - "portsfile" : "true" - }, - "tasks" : { - "r1A2WoRbTwKZ516z6NEs5A:36619" : { - "node" : "r1A2WoRbTwKZ516z6NEs5A", - "id" : 36619, - "type" : "transport", - "action" : "indices:data/write/reindex", - "status" : { <1> - "total" : 6154, - "updated" : 3500, - "created" : 0, - "deleted" : 0, - "batches" : 4, - "version_conflicts" : 0, - "noops" : 0, - "retries": { - "bulk": 0, - "search": 0 - }, - "throttled_millis": 0, - "requests_per_second": -1, - "throttled_until_millis": 0 - }, - "description" : "", - "start_time_in_millis": 1535149899665, - "running_time_in_nanos": 5926916792, - "cancellable": true, - "headers": {} - } - } - } - } -} --------------------------------------------------- - -<1> This object contains the actual status. It is identical to the response JSON -except for the important addition of the `total` field. `total` is the total number -of operations that the `_reindex` expects to perform. You can estimate the -progress by adding the `updated`, `created`, and `deleted` fields. The request -will finish when their sum is equal to the `total` field. - -With the task id you can look up the task directly. The following example -retrieves information about the task `r1A2WoRbTwKZ516z6NEs5A:36619`: - -[source,console] --------------------------------------------------- -GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619 --------------------------------------------------- -// TEST[catch:missing] - -The advantage of this API is that it integrates with `wait_for_completion=false` -to transparently return the status of completed tasks. If the task is completed -and `wait_for_completion=false` was set, it will return a -`results` or an `error` field. The cost of this feature is the document that -`wait_for_completion=false` creates at `.tasks/task/${taskId}`. It is up to -you to delete that document. - - -[float] -[[docs-reindex-cancel-task-api]] -==== Works with the Cancel Task API - -Any reindex can be canceled using the <>. For -example: - -[source,console] --------------------------------------------------- -POST _tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel --------------------------------------------------- - -The task ID can be found using the <>. - -Cancelation should happen quickly but might take a few seconds. The Tasks -API will continue to list the task until it wakes to cancel itself. - - -[float] -[[docs-reindex-rethrottle]] -==== Rethrottling - -The value of `requests_per_second` can be changed on a running reindex using -the `_rethrottle` API: - -[source,console] --------------------------------------------------- -POST _reindex/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1 --------------------------------------------------- - -The task ID can be found using the <>. - -Just like when setting it on the Reindex API, `requests_per_second` -can be either `-1` to disable throttling or any decimal number -like `1.7` or `12` to throttle to that level. Rethrottling that speeds up the -query takes effect immediately, but rethrottling that slows down the query will -take effect after completing the current batch. This prevents scroll -timeouts. - -[float] [[docs-reindex-change-name]] -==== Reindex to change the name of a field +===== Reindex to change the name of a field `_reindex` can be used to build a copy of an index with renamed fields. Say you create an index containing documents that look like this: @@ -948,276 +756,364 @@ which will return: -------------------------------------------------- // TESTRESPONSE[s/"_seq_no": \d+/"_seq_no" : $body._seq_no/ s/"_primary_term": 1/"_primary_term" : $body._primary_term/] -[float] -[[docs-reindex-slice]] -==== Slicing +[[docs-reindex-daily-indices]] +===== Reindex daily indices -Reindex supports <> to parallelize the reindexing process. -This parallelization can improve efficiency and provide a convenient way to -break the request down into smaller parts. +You can use `_reindex` in combination with <> to reindex +daily indices to apply a new template to the existing documents. -NOTE: Reindexing from remote clusters does not support -<> or -<>. +Assuming you have indices that contain documents like: -[float] -[[docs-reindex-manual-slice]] -===== Manual slicing -Slice a reindex request manually by providing a slice id and total number of -slices to each request: +[source,console] +---------------------------------------------------------------- +PUT metricbeat-2016.05.30/_doc/1?refresh +{"system.cpu.idle.pct": 0.908} +PUT metricbeat-2016.05.31/_doc/1?refresh +{"system.cpu.idle.pct": 0.105} +---------------------------------------------------------------- + +The new template for the `metricbeat-*` indices is already loaded into Elasticsearch, +but it applies only to the newly created indices. Painless can be used to reindex +the existing documents and apply the new template. + +The script below extracts the date from the index name and creates a new index +with `-1` appended. All data from `metricbeat-2016.05.31` will be reindexed +into `metricbeat-2016.05.31-1`. [source,console] ---------------------------------------------------------------- POST _reindex { "source": { - "index": "twitter", - "slice": { - "id": 0, - "max": 2 - } + "index": "metricbeat-*" }, "dest": { - "index": "new_twitter" - } -} -POST _reindex -{ - "source": { - "index": "twitter", - "slice": { - "id": 1, - "max": 2 - } + "index": "metricbeat" }, - "dest": { - "index": "new_twitter" + "script": { + "lang": "painless", + "source": "ctx._index = 'metricbeat-' + (ctx._index.substring('metricbeat-'.length(), ctx._index.length())) + '-1'" } } ---------------------------------------------------------------- -// TEST[setup:big_twitter] +// TEST[continued] -You can verify this works by: +All documents from the previous metricbeat indices can now be found in the `*-1` indices. [source,console] ---------------------------------------------------------------- -GET _refresh -POST new_twitter/_search?size=0&filter_path=hits.total +GET metricbeat-2016.05.30-1/_doc/1 +GET metricbeat-2016.05.31-1/_doc/1 ---------------------------------------------------------------- // TEST[continued] -which results in a sensible `total` like this one: - -[source,console-result] ----------------------------------------------------------------- -{ - "hits": { - "total" : { - "value": 120, - "relation": "eq" - } - } -} ----------------------------------------------------------------- +The previous method can also be used in conjunction with <> +to load only the existing data into the new index and rename any fields if needed. -[float] -[[docs-reindex-automatic-slice]] -===== Automatic slicing +[[docs-reindex-api-subset]] +===== Extract a random subset of an index -You can also let `_reindex` automatically parallelize using <> to -slice on `_uid`. Use `slices` to specify the number of slices to use: +`_reindex` can be used to extract a random subset of an index for testing: [source,console] ---------------------------------------------------------------- -POST _reindex?slices=5&refresh +POST _reindex { + "max_docs": 10, "source": { - "index": "twitter" + "index": "twitter", + "query": { + "function_score" : { + "query" : { "match_all": {} }, + "random_score" : {} + } + }, + "sort": "_score" <1> }, "dest": { - "index": "new_twitter" + "index": "random_twitter" } } ---------------------------------------------------------------- // TEST[setup:big_twitter] -You can also this verify works by: +<1> `_reindex` defaults to sorting by `_doc` so `random_score` will not have any +effect unless you override the sort to `_score`. -[source,console] ----------------------------------------------------------------- -POST new_twitter/_search?size=0&filter_path=hits.total ----------------------------------------------------------------- -// TEST[continued] +[[reindex-scripts]] +===== Modify documents during reindexing -which results in a sensible `total` like this one: +Like `_update_by_query`, `_reindex` supports a script that modifies the +document. Unlike `_update_by_query`, the script is allowed to modify the +document's metadata. This example bumps the version of the source document: -[source,console-result] ----------------------------------------------------------------- +[source,console] +-------------------------------------------------- +POST _reindex { - "hits": { - "total" : { - "value": 120, - "relation": "eq" - } + "source": { + "index": "twitter" + }, + "dest": { + "index": "new_twitter", + "version_type": "external" + }, + "script": { + "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}", + "lang": "painless" } } ----------------------------------------------------------------- +-------------------------------------------------- +// TEST[setup:twitter] -Setting `slices` to `auto` will let Elasticsearch choose the number of slices -to use. This setting will use one slice per shard, up to a certain limit. If -there are multiple source indices, it will choose the number of slices based -on the index with the smallest number of shards. +Just as in `_update_by_query`, you can set `ctx.op` to change the +operation that is executed on the destination index: -Adding `slices` to `_reindex` just automates the manual process used in the -section above, creating sub-requests which means it has some quirks: +`noop`:: -* You can see these requests in the <>. These -sub-requests are "child" tasks of the task for the request with `slices`. -* Fetching the status of the task for the request with `slices` only contains -the status of completed slices. -* These sub-requests are individually addressable for things like cancelation -and rethrottling. -* Rethrottling the request with `slices` will rethrottle the unfinished -sub-request proportionally. -* Canceling the request with `slices` will cancel each sub-request. -* Due to the nature of `slices` each sub-request won't get a perfectly even -portion of the documents. All documents will be addressed, but some slices may -be larger than others. Expect larger slices to have a more even distribution. -* Parameters like `requests_per_second` and `max_docs` on a request with -`slices` are distributed proportionally to each sub-request. Combine that with -the point above about distribution being uneven and you should conclude that -using `max_docs` with `slices` might not result in exactly `max_docs` documents -being reindexed. -* Each sub-request gets a slightly different snapshot of the source index, -though these are all taken at approximately the same time. +Set `ctx.op = "noop"` if your script decides that the document doesn't have +to be indexed in the destination index. This no operation will be reported +in the `noop` counter in the <>. -[float] -[[docs-reindex-picking-slices]] -====== Picking the number of slices +`delete`:: -If slicing automatically, setting `slices` to `auto` will choose a reasonable -number for most indices. If slicing manually or otherwise tuning -automatic slicing, use these guidelines. +Set `ctx.op = "delete"` if your script decides that the document must be + deleted from the destination index. The deletion will be reported in the + `deleted` counter in the <>. -Query performance is most efficient when the number of `slices` is equal to the -number of shards in the index. If that number is large (e.g. 500), -choose a lower number as too many `slices` will hurt performance. Setting -`slices` higher than the number of shards generally does not improve efficiency -and adds overhead. +Setting `ctx.op` to anything else will return an error, as will setting any +other field in `ctx`. -Indexing performance scales linearly across available resources with the -number of slices. +Think of the possibilities! Just be careful; you are able to +change: -Whether query or indexing performance dominates the runtime depends on the -documents being reindexed and cluster resources. + * `_id` + * `_index` + * `_version` + * `_routing` -[float] -==== Reindexing many indices -If you have many indices to reindex it is generally better to reindex them -one at a time rather than using a glob pattern to pick up many indices. That -way you can resume the process if there are any errors by removing the -partially completed index and starting over at that index. It also makes -parallelizing the process fairly simple: split the list of indices to reindex -and run each list in parallel. +Setting `_version` to `null` or clearing it from the `ctx` map is just like not +sending the version in an indexing request; it will cause the document to be +overwritten in the target index regardless of the version on the target or the +version type you use in the `_reindex` request. -One-off bash scripts seem to work nicely for this: +[[reindex-from-remote]] +==== Reindex from remote -[source,bash] ----------------------------------------------------------------- -for index in i1 i2 i3 i4 i5; do - curl -HContent-Type:application/json -XPOST localhost:9200/_reindex?pretty -d'{ - "source": { - "index": "'$index'" +Reindex supports reindexing from a remote Elasticsearch cluster: + +[source,console] +-------------------------------------------------- +POST _reindex +{ + "source": { + "remote": { + "host": "http://otherhost:9200", + "username": "user", + "password": "pass" }, - "dest": { - "index": "'$index'-reindexed" + "index": "source", + "query": { + "match": { + "test": "data" + } } - }' -done ----------------------------------------------------------------- -// NOTCONSOLE + }, + "dest": { + "index": "dest" + } +} +-------------------------------------------------- +// TEST[setup:host] +// TEST[s/^/PUT source\n/] +// TEST[s/otherhost:9200",/\${host}"/] +// TEST[s/"username": "user",//] +// TEST[s/"password": "pass"//] -[float] -==== Reindex daily indices +The `host` parameter must contain a scheme, host, port (e.g. +`https://otherhost:9200`), and optional path (e.g. `https://otherhost:9200/proxy`). +The `username` and `password` parameters are optional, and when they are present `_reindex` +will connect to the remote Elasticsearch node using basic auth. Be sure to use `https` when +using basic auth or the password will be sent in plain text. +There are a range of <> available to configure the behaviour of the + `https` connection. -Notwithstanding the above advice, you can use `_reindex` in combination with -<> to reindex daily indices to apply -a new template to the existing documents. +Remote hosts have to be explicitly whitelisted in elasticsearch.yml using the +`reindex.remote.whitelist` property. It can be set to a comma delimited list +of allowed remote `host` and `port` combinations (e.g. +`otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*`). Scheme is +ignored by the whitelist -- only host and port are used, for example: -Assuming you have indices consisting of documents as follows: -[source,console] ----------------------------------------------------------------- -PUT metricbeat-2016.05.30/_doc/1?refresh -{"system.cpu.idle.pct": 0.908} -PUT metricbeat-2016.05.31/_doc/1?refresh -{"system.cpu.idle.pct": 0.105} ----------------------------------------------------------------- +[source,yaml] +-------------------------------------------------- +reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*" +-------------------------------------------------- -The new template for the `metricbeat-*` indices is already loaded into Elasticsearch, -but it applies only to the newly created indices. Painless can be used to reindex -the existing documents and apply the new template. +The whitelist must be configured on any nodes that will coordinate the reindex. -The script below extracts the date from the index name and creates a new index -with `-1` appended. All data from `metricbeat-2016.05.31` will be reindexed -into `metricbeat-2016.05.31-1`. +This feature should work with remote clusters of any version of Elasticsearch +you are likely to find. This should allow you to upgrade from any version of +Elasticsearch to the current version by reindexing from a cluster of the old +version. + +To enable queries sent to older versions of Elasticsearch the `query` parameter +is sent directly to the remote host without validation or modification. + +NOTE: Reindexing from remote clusters does not support +<> or +<>. + +Reindexing from a remote server uses an on-heap buffer that defaults to a +maximum size of 100mb. If the remote index includes very large documents you'll +need to use a smaller batch size. The example below sets the batch size to `10` +which is very, very small. [source,console] ----------------------------------------------------------------- +-------------------------------------------------- POST _reindex { "source": { - "index": "metricbeat-*" + "remote": { + "host": "http://otherhost:9200" + }, + "index": "source", + "size": 10, + "query": { + "match": { + "test": "data" + } + } }, "dest": { - "index": "metricbeat" - }, - "script": { - "lang": "painless", - "source": "ctx._index = 'metricbeat-' + (ctx._index.substring('metricbeat-'.length(), ctx._index.length())) + '-1'" + "index": "dest" } } ----------------------------------------------------------------- -// TEST[continued] - -All documents from the previous metricbeat indices can now be found in the `*-1` indices. - -[source,console] ----------------------------------------------------------------- -GET metricbeat-2016.05.30-1/_doc/1 -GET metricbeat-2016.05.31-1/_doc/1 ----------------------------------------------------------------- -// TEST[continued] - -The previous method can also be used in conjunction with <> -to load only the existing data into the new index and rename any fields if needed. - -[float] -==== Extracting a random subset of an index +-------------------------------------------------- +// TEST[setup:host] +// TEST[s/^/PUT source\n/] +// TEST[s/otherhost:9200/\${host}/] -`_reindex` can be used to extract a random subset of an index for testing: +It is also possible to set the socket read timeout on the remote connection +with the `socket_timeout` field and the connection timeout with the +`connect_timeout` field. Both default to 30 seconds. This example +sets the socket read timeout to one minute and the connection timeout to 10 +seconds: [source,console] ----------------------------------------------------------------- +-------------------------------------------------- POST _reindex { - "max_docs": 10, "source": { - "index": "twitter", + "remote": { + "host": "http://otherhost:9200", + "socket_timeout": "1m", + "connect_timeout": "10s" + }, + "index": "source", "query": { - "function_score" : { - "query" : { "match_all": {} }, - "random_score" : {} + "match": { + "test": "data" } - }, - "sort": "_score" <1> + } }, "dest": { - "index": "random_twitter" + "index": "dest" } } ----------------------------------------------------------------- -// TEST[setup:big_twitter] +-------------------------------------------------- +// TEST[setup:host] +// TEST[s/^/PUT source\n/] +// TEST[s/otherhost:9200/\${host}/] -<1> `_reindex` defaults to sorting by `_doc` so `random_score` will not have any -effect unless you override the sort to `_score`. +[[reindex-ssl]] +===== Configuring SSL parameters + +Reindex from remote supports configurable SSL settings. These must be +specified in the `elasticsearch.yml` file, with the exception of the +secure settings, which you add in the Elasticsearch keystore. +It is not possible to configure SSL in the body of the `_reindex` request. + +The following settings are supported: + +`reindex.ssl.certificate_authorities`:: +List of paths to PEM encoded certificate files that should be trusted. +You cannot specify both `reindex.ssl.certificate_authorities` and +`reindex.ssl.truststore.path`. + +`reindex.ssl.truststore.path`:: +The path to the Java Keystore file that contains the certificates to trust. +This keystore can be in "JKS" or "PKCS#12" format. +You cannot specify both `reindex.ssl.certificate_authorities` and +`reindex.ssl.truststore.path`. + +`reindex.ssl.truststore.password`:: +The password to the truststore (`reindex.ssl.truststore.path`). +This setting cannot be used with `reindex.ssl.truststore.secure_password`. + +`reindex.ssl.truststore.secure_password` (<>):: +The password to the truststore (`reindex.ssl.truststore.path`). +This setting cannot be used with `reindex.ssl.truststore.password`. + +`reindex.ssl.truststore.type`:: +The type of the truststore (`reindex.ssl.truststore.path`). +Must be either `jks` or `PKCS12`. If the truststore path ends in ".p12", ".pfx" +or "pkcs12", this setting defaults to `PKCS12`. Otherwise, it defaults to `jks`. + +`reindex.ssl.verification_mode`:: +Indicates the type of verification to protect against man in the middle attacks +and certificate forgery. +One of `full` (verify the hostname and the certificate path), `certificate` +(verify the certificate path, but not the hostname) or `none` (perform no +verification - this is strongly discouraged in production environments). +Defaults to `full`. + +`reindex.ssl.certificate`:: +Specifies the path to the PEM encoded certificate (or certificate chain) to be +used for HTTP client authentication (if required by the remote cluster) +This setting requires that `reindex.ssl.key` also be set. +You cannot specify both `reindex.ssl.certificate` and `reindex.ssl.keystore.path`. + +`reindex.ssl.key`:: +Specifies the path to the PEM encoded private key associated with the +certificate used for client authentication (`reindex.ssl.certificate`). +You cannot specify both `reindex.ssl.key` and `reindex.ssl.keystore.path`. + +`reindex.ssl.key_passphrase`:: +Specifies the passphrase to decrypt the PEM encoded private key +(`reindex.ssl.key`) if it is encrypted. +Cannot be used with `reindex.ssl.secure_key_passphrase`. + +`reindex.ssl.secure_key_passphrase` (<>):: +Specifies the passphrase to decrypt the PEM encoded private key +(`reindex.ssl.key`) if it is encrypted. +Cannot be used with `reindex.ssl.key_passphrase`. + +`reindex.ssl.keystore.path`:: +Specifies the path to the keystore that contains a private key and certificate +to be used for HTTP client authentication (if required by the remote cluster). +This keystore can be in "JKS" or "PKCS#12" format. +You cannot specify both `reindex.ssl.key` and `reindex.ssl.keystore.path`. + +`reindex.ssl.keystore.type`:: +The type of the keystore (`reindex.ssl.keystore.path`). Must be either `jks` or `PKCS12`. +If the keystore path ends in ".p12", ".pfx" or "pkcs12", this setting defaults +to `PKCS12`. Otherwise, it defaults to `jks`. + +`reindex.ssl.keystore.password`:: +The password to the keystore (`reindex.ssl.keystore.path`). This setting cannot be used +with `reindex.ssl.keystore.secure_password`. + +`reindex.ssl.keystore.secure_password` (<>):: +The password to the keystore (`reindex.ssl.keystore.path`). +This setting cannot be used with `reindex.ssl.keystore.password`. + +`reindex.ssl.keystore.key_password`:: +The password for the key in the keystore (`reindex.ssl.keystore.path`). +Defaults to the keystore password. This setting cannot be used with +`reindex.ssl.keystore.secure_key_password`. + +`reindex.ssl.keystore.secure_key_password` (<>):: +The password for the key in the keystore (`reindex.ssl.keystore.path`). +Defaults to the keystore password. This setting cannot be used with +`reindex.ssl.keystore.key_password`.