Basic reindex implementation #15125

nik9000 · 2015-11-30T20:43:51Z

Note: this has been edited from my first comment.

This creates basic reindex and update-by-query implementations that work like delete-by-query. They scroll documents and then turn around and issue bulk requests for each.

This isn't the last word on reindex. Before we merge the feature branch down to master we'll need to fix it to retry on bulk rejection and integrate it with the task management api.

For those of you reading the issues, realize that all the discussion about APIs is discussion. The API we're going to merge is the result of those discussions. You can read about it by reading the asciidoc files in the PR.

nik9000 · 2015-11-30T20:45:40Z

@tlrx this is my start. I'm intentionally not copying things from delete-by-query because I'm trying to understand why it has the pieces it has.

clintongormley · 2015-12-01T10:21:47Z

docs/plugins/index-by-search.asciidoc

+The index-by-search plugin adds support for indexing all documents that match
+a query. Internally it uses {ref}/search-request-scroll.html[Scroll] and
+{ref}/docs-bulk.html[Bulk] APIs much like
+{ref}/plusin-delete-by-query[Delete by Query] though it might not use that


plusin -> plugin

clintongormley · 2015-12-01T12:45:31Z

Some thoughts...

Initially I was a fan of specifying the index and type in the URL, but now I think it is confusing. For instance, what if the source index is in a remote cluster? I'm leaning more towards specifying the source and destination explicitly, eg:

POST _index_by_search
{
  "src": {
    "index": ["index_1", "index_2"],
    "query": {
      "match": {
        "status": "published"
      }
    }
  },
  "dest": {
    "index": "new_index"
  }
}

Essentially, the src would accept anything that a search request would accept, and dest would be used to configure the bulk request.

Routing could be changed or removed as follows:

POST _index_by_search
{
  "src": {
    "index": ["index_1", "index_2"],
    "routing": ["foo", "bar"],
    "query": {
      "match": {
        "status": "published"
      }
    }
  },
  "dest": {
    "index": "new_index",
    "routing": null
  }
}

The version_type parameter could be used as follows:

external: set the new version to the same value as the existing version (and fail if the document already exists)
internal: update the document only if it exists and is the same version as retrieved by the scroll request
none: write the new document regardless of whether it exists or not, or whether it has already been updated or not

nik9000 · 2015-12-01T20:14:37Z

Initially I was a fan of specifying the index and type in the URL, but now I think it is confusing.

I think I'm with you. I'll move it to the single endpoint.

none: write the new document regardless of whether it exists or not, or whether it has already been updated or not

We actually have force for that now.

"routing": null

Right now it copies routing values if they are set. I figure that should be the default. Maybe call it "routing": "keep" and make "routing": "discard" cause it to just throw the routing away.

external: set the new version to the same value as the existing version (and fail if the document already exists)

Can we follow the usual external semantics and write the document only if the version is newer?

internal: update the document only if it exists and is the same version as retrieved by the scroll request

I figured internal would just work like internal versioning works in bulk requests and write the document with version 1 if it doesn't exist.

One thing that comes up when I play with this is that its simpler to just support "index" requests in the bulk, at least for now. I'll think about it some more because update requests would be useful because they'd let you use scripts for free. You could probably get away with only returning smaller portions of the document during the query too.

clintongormley · 2015-12-02T11:26:22Z

none: write the new document regardless of whether it exists or not, or whether it has already been updated or not
We actually have force for that now.

force is a slightly different, in that it will forcibly set the version to the specified value (which could cause the version number to drop), while what I was envisaging was just incrementing the version number, the same way we would if a version hadn't been specified.

"routing": null
Right now it copies routing values if they are set. I figure that should be the default. Maybe call it "routing": "keep" and make "routing": "discard" cause it to just throw the routing away.

The nice thing about routing: null is that you could also set it to routing: foo.

Can we follow the usual external semantics and write the document only if the version is newer?

Not if you're indexing into the same index - nothing would be updated. The use case I'm seeing here is: we use external versioning so eg a db is the source of the version number, now we want to add some multi-field with a new mapping and backfill existing docs, so we reindex but keep the version number the same.

We could use force to index the doc regardless and to keep the same version number, but that wouldn't allow for skipping newer docs... not sure if this is an issue.

I figured internal would just work like internal versioning works in bulk requests and write the document with version 1 if it doesn't exist.

This isn't how it works. Try this:

PUT t/t/1?version=4
{}

You'll get a conflict exception because the document doesn't exist. Instead we could use none (which should just delete the version number before indexing).

This versioning thing is complex. We should probably have a brainstorming session to make sure that we nail it down correctly.

bleskes · 2015-12-02T12:10:48Z

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

For another index it's tricky. If people use external versioning in general, then it's a good fit. If they don't and we concurrently index into a remote while other processes also index it then external version semantics doesn't really make sense . It doesn't mean much that the version in the source index is higher than the one in the target index. In that case I think the only two options are either create (i.e., only reindex if the document doesn't exist) or override (i.e., always index, no guarantees over which copy survives).

In all cases I don't think force makes sense. It can be used as a measure of last resort where people use one index as a source of truth to another and want to override existing docs, but it's super expert and dangerous.

clintongormley · 2015-12-02T13:06:26Z

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

We probably also need the ability to specify what should happen if there is a version conflict: ignore and continue, or throw an exception.

nik9000 · 2015-12-02T15:13:19Z

The nice thing about routing: null is that you could also set it to routing: foo.

I implemented routing: keep, routing: discard and routing: =foo last night. It feels a bit better than relying unset being different from null. I'm happy to iterate on it. Its simple to change.

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

But that'll bump the version number which isn't what someone using external versioning wants. I wonder if those folks are just out of luck here?

either create (i.e., only reindex if the document doesn't exist) or override (i.e., always index, no guarantees over which copy survives).

I think this'd work fine for now. The big use case for writing to an other index is when someone wants to change analysis or sharding.

bleskes · 2015-12-02T18:33:57Z

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

But that'll bump the version number which isn't what someone using external versioning wants. I wonder if those folks are just out of luck here?

Good point and the first valid use for EXTERNAL_GTE I heard :) .In general I was talking about the defaults for the operations that make sense. In general I think we should just allow people to specify what ever versioning support they want.

rjernst · 2015-12-02T19:48:47Z

plugins/index-by-search/licenses/no_deps.txt

@@ -0,0 +1 @@
+This plugin has no third party dependencies


This directory/file is no longer needed.

Will remove.

This creates an index-by-search plugin with a very basic, shell of an implementation of that is very like delete-by-query. At some point we'll integrate it with the task managament work, but for now this works.

Fixes to make the example work

Adds a test for routing in general and a test for the grandparent case.

Small fix around sorts, validation, etc.

nik9000 · 2015-12-02T20:02:53Z

Rebased so I could remove the file. I figure rebase is ok here because no one is actively reviewing the code.

clintongormley · 2016-01-12T15:12:59Z

docs/plugins/reindex.asciidoc

+}
+--------------------------------------------------
+
+The `src` parameter can also be specified as `source` for those that like that


Let's keep the naming simple : src and dest

See, I think "source" is more simple.

I'd also be happy with source and dest

nik9000 · 2016-01-12T15:19:53Z

@s1monw I've pushed a few test changes that move some of the tests from integration to unit tests - tests for failure states and tests for how scripts interact with the request. I think some things should remain integration tests - some smoke tests for scripts, the test that update_by_query never reverts changes from another updater, some of the basic smoke tests. But I'll work to make more things unit tests where possible. How far do you think I should go before I merge this into the feature branch and get to work on integrating it with task management? Task management will involve lots more unit tests for things like cancel-ability, status reporting, and eventually throttling.

clintongormley · 2016-01-12T15:19:58Z

docs/plugins/reindex.asciidoc

+preserved unless it's changed by scripting. You can set `routing` on the `dest`
+request to change this:
+
+`keep`::


meh, i don't like these options. I'd prefer to go with:

routing is preserved by default

set it to null to clear

set it to another string to set it to that string

use a script to make a decision per doc

set it to null to clear

This means null is different from the default. Are we ok with that?

clintongormley · 2016-01-12T15:26:10Z

Just looked through the docs - looking awesome! I assume you're planning to move this to a module rather than a plugin before merging.

nik9000 · 2016-01-12T15:31:17Z

I assume you're planning to move this to a module rather than a plugin before merging.

I dunno what the plan is. I'm happy to module-ify it before merging it to master. Right now I just want to get this into the feature branch so I can start working on task management.

Now its just source and dest.

s1monw · 2016-01-13T11:01:41Z

@nik9000 thanks for working on the tests - I think we can improve over time so if nobody objects lets get this going and move it into master?

nik9000 · 2016-01-13T14:38:59Z

move it into master

I believe the plan is to integrate it with task management, probably move it to a module, and then move it to master. But I'll certainly merge to the feature branch.

nik9000 · 2016-01-13T14:46:08Z

Now that I've merged this to feature/reindex I'm actually going to squash it to one commit. I should have done that before merging but I got excited to click the big green button.

danielmitterdorfer · 2016-01-13T14:52:35Z

Incredible, the first huge step is done! Congratulations! :)

tlrx · 2016-01-14T08:45:38Z

Well done @nik9000 !

nik9000 added :Reindex API labels Nov 30, 2015

clintongormley reviewed Dec 1, 2015
View reviewed changes

clintongormley mentioned this pull request Dec 2, 2015

Upgrade fields with dot character to 2.0.0 #15122

Closed

rjernst reviewed Dec 2, 2015
View reviewed changes

nik9000 added 15 commits December 2, 2015 14:56

Basic index-by-search

839bb21

This creates an index-by-search plugin with a very basic, shell of an implementation of that is very like delete-by-query. At some point we'll integrate it with the task managament work, but for now this works.

Fix vagrant tests

e918bc1

Docs

39e228e

Basic REST

7501c00

Remove leftovers

eee9583

tabs!

f4ec860

More docs

542c656

Fixes to make the example work

Copy routing

d02f277

Adds a test for routing in general and a test for the grandparent case.

Add support for preserving external versions

e5ae49f

Be more careful with the search request

169e45e

Clean up rest

6db33aa

More docs and tests

81430f9

Small fix around sorts, validation, etc.

Fix more validation

f81e4e4

Add support for routing

dd8fa09

Remove file that is no longer needed

222c986

nik9000 force-pushed the first_reindex branch from 8c6ae74 to 222c986 Compare December 2, 2015 20:01

clintongormley reviewed Jan 12, 2016
View reviewed changes

nik9000 added 2 commits January 12, 2016 10:26

[docs] Make a list of metadata

c0f3635

[docs] Describe timeout

8e19e4f

nik9000 added 4 commits January 12, 2016 10:56

[doc] Add more callouts

6b85c37

Default conflicts to ignore

e07a49b

Remove src and destination

ae8dbe2

Now its just source and dest.

Move metadata tests to unit tests

a8a3467

nik9000 merged this pull request into elastic:feature/reindex Jan 13, 2016

eskibars mentioned this pull request Jan 13, 2016

Reindex from _source by document ID or Query #492

Closed

ofavre mentioned this pull request Jan 13, 2016

Not working on ES 2.0 yakaz/elasticsearch-action-updatebyquery#46

Open

clintongormley removed the v5.0.0-alpha1 label Feb 13, 2016

nik9000 mentioned this pull request Jun 6, 2016

Expose version type to Update & Delete by query #18750

Closed

lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic reindex implementation #15125

Basic reindex implementation #15125

nik9000 commented Nov 30, 2015

nik9000 commented Nov 30, 2015

clintongormley Dec 1, 2015

clintongormley commented Dec 1, 2015

nik9000 commented Dec 1, 2015

clintongormley commented Dec 2, 2015

bleskes commented Dec 2, 2015

clintongormley commented Dec 2, 2015

nik9000 commented Dec 2, 2015

bleskes commented Dec 2, 2015

rjernst Dec 2, 2015

nik9000 Dec 2, 2015

nik9000 commented Dec 2, 2015

clintongormley Jan 12, 2016

nik9000 Jan 12, 2016

clintongormley Jan 12, 2016

nik9000 Jan 12, 2016

nik9000 commented Jan 12, 2016

clintongormley Jan 12, 2016

nik9000 Jan 12, 2016

clintongormley commented Jan 12, 2016

nik9000 commented Jan 12, 2016

s1monw commented Jan 13, 2016

nik9000 commented Jan 13, 2016

nik9000 commented Jan 13, 2016

danielmitterdorfer commented Jan 13, 2016

tlrx commented Jan 14, 2016

Basic reindex implementation #15125

Basic reindex implementation #15125

Conversation

nik9000 commented Nov 30, 2015

nik9000 commented Nov 30, 2015

Choose a reason for hiding this comment

clintongormley commented Dec 1, 2015

nik9000 commented Dec 1, 2015

clintongormley commented Dec 2, 2015

bleskes commented Dec 2, 2015

clintongormley commented Dec 2, 2015

nik9000 commented Dec 2, 2015

bleskes commented Dec 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Dec 2, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nik9000 commented Jan 12, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clintongormley commented Jan 12, 2016

nik9000 commented Jan 12, 2016

s1monw commented Jan 13, 2016

nik9000 commented Jan 13, 2016

nik9000 commented Jan 13, 2016

danielmitterdorfer commented Jan 13, 2016

tlrx commented Jan 14, 2016