-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic reindex implementation #15125
Basic reindex implementation #15125
Conversation
@tlrx this is my start. I'm intentionally not copying things from delete-by-query because I'm trying to understand why it has the pieces it has. |
The index-by-search plugin adds support for indexing all documents that match | ||
a query. Internally it uses {ref}/search-request-scroll.html[Scroll] and | ||
{ref}/docs-bulk.html[Bulk] APIs much like | ||
{ref}/plusin-delete-by-query[Delete by Query] though it might not use that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plusin -> plugin
Some thoughts... Initially I was a fan of specifying the index and type in the URL, but now I think it is confusing. For instance, what if the source index is in a remote cluster? I'm leaning more towards specifying the source and destination explicitly, eg:
Essentially, the Routing could be changed or removed as follows:
The
|
I think I'm with you. I'll move it to the single endpoint.
We actually have
Right now it copies routing values if they are set. I figure that should be the default. Maybe call it
Can we follow the usual
I figured One thing that comes up when I play with this is that its simpler to just support "index" requests in the bulk, at least for now. I'll think about it some more because update requests would be useful because they'd let you use scripts for free. You could probably get away with only returning smaller portions of the document during the query too. |
The nice thing about
Not if you're indexing into the same index - nothing would be updated. The use case I'm seeing here is: we use external versioning so eg a db is the source of the version number, now we want to add some multi-field with a new mapping and backfill existing docs, so we reindex but keep the version number the same. We could use
This isn't how it works. Try this:
You'll get a conflict exception because the document doesn't exist. Instead we could use This versioning thing is complex. We should probably have a brainstorming session to make sure that we nail it down correctly. |
when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed. For another index it's tricky. If people use external versioning in general, then it's a good fit. If they don't and we concurrently index into a remote while other processes also index it then external version semantics doesn't really make sense . It doesn't mean much that the version in the source index is higher than the one in the target index. In that case I think the only two options are either create (i.e., only reindex if the document doesn't exist) or override (i.e., always index, no guarantees over which copy survives). In all cases I don't think force makes sense. It can be used as a measure of last resort where people use one index as a source of truth to another and want to override existing docs, but it's super expert and dangerous. |
We probably also need the ability to specify what should happen if there is a version conflict: ignore and continue, or throw an exception. |
I implemented
But that'll bump the version number which isn't what someone using external versioning wants. I wonder if those folks are just out of luck here?
I think this'd work fine for now. The big use case for writing to an other index is when someone wants to change analysis or sharding. |
Good point and the first valid use for |
@@ -0,0 +1 @@ | |||
This plugin has no third party dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directory/file is no longer needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove.
This creates an index-by-search plugin with a very basic, shell of an implementation of that is very like delete-by-query. At some point we'll integrate it with the task managament work, but for now this works.
Adds a test for routing in general and a test for the grandparent case.
Small fix around sorts, validation, etc.
8c6ae74
to
222c986
Compare
Rebased so I could remove the file. I figure rebase is ok here because no one is actively reviewing the code. |
} | ||
-------------------------------------------------- | ||
|
||
The `src` parameter can also be specified as `source` for those that like that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the naming simple : src
and dest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See, I think "source" is more simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also be happy with source
and dest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@s1monw I've pushed a few test changes that move some of the tests from integration to unit tests - tests for failure states and tests for how scripts interact with the request. I think some things should remain integration tests - some smoke tests for scripts, the test that update_by_query never reverts changes from another updater, some of the basic smoke tests. But I'll work to make more things unit tests where possible. How far do you think I should go before I merge this into the feature branch and get to work on integrating it with task management? Task management will involve lots more unit tests for things like cancel-ability, status reporting, and eventually throttling. |
preserved unless it's changed by scripting. You can set `routing` on the `dest` | ||
request to change this: | ||
|
||
`keep`:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meh, i don't like these options. I'd prefer to go with:
- routing is preserved by default
- set it to
null
to clear - set it to another string to set it to that string
- use a script to make a decision per doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set it to null to clear
This means null
is different from the default. Are we ok with that?
Just looked through the docs - looking awesome! I assume you're planning to move this to a module rather than a plugin before merging. |
I dunno what the plan is. I'm happy to module-ify it before merging it to master. Right now I just want to get this into the feature branch so I can start working on task management. |
Now its just source and dest.
@nik9000 thanks for working on the tests - I think we can improve over time so if nobody objects lets get this going and move it into master? |
I believe the plan is to integrate it with task management, probably move it to a module, and then move it to master. But I'll certainly merge to the feature branch. |
Now that I've merged this to feature/reindex I'm actually going to squash it to one commit. I should have done that before merging but I got excited to click the big green button. |
Incredible, the first huge step is done! Congratulations! :) |
Well done @nik9000 ! |
Note: this has been edited from my first comment.
This creates basic reindex and update-by-query implementations that work like delete-by-query. They scroll documents and then turn around and issue bulk requests for each.
This isn't the last word on reindex. Before we merge the feature branch down to master we'll need to fix it to retry on bulk rejection and integrate it with the task management api.
For those of you reading the issues, realize that all the discussion about APIs is discussion. The API we're going to merge is the result of those discussions. You can read about it by reading the asciidoc files in the PR.