Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic reindex implementation #15125

Merged
merged 129 commits into from
Jan 13, 2016
Merged

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Nov 30, 2015

Note: this has been edited from my first comment.

This creates basic reindex and update-by-query implementations that work like delete-by-query. They scroll documents and then turn around and issue bulk requests for each.

This isn't the last word on reindex. Before we merge the feature branch down to master we'll need to fix it to retry on bulk rejection and integrate it with the task management api.

For those of you reading the issues, realize that all the discussion about APIs is discussion. The API we're going to merge is the result of those discussions. You can read about it by reading the asciidoc files in the PR.

@nik9000
Copy link
Member Author

nik9000 commented Nov 30, 2015

@tlrx this is my start. I'm intentionally not copying things from delete-by-query because I'm trying to understand why it has the pieces it has.

The index-by-search plugin adds support for indexing all documents that match
a query. Internally it uses {ref}/search-request-scroll.html[Scroll] and
{ref}/docs-bulk.html[Bulk] APIs much like
{ref}/plusin-delete-by-query[Delete by Query] though it might not use that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plusin -> plugin

@clintongormley
Copy link
Contributor

Some thoughts...

Initially I was a fan of specifying the index and type in the URL, but now I think it is confusing. For instance, what if the source index is in a remote cluster? I'm leaning more towards specifying the source and destination explicitly, eg:

POST _index_by_search
{
  "src": {
    "index": ["index_1", "index_2"],
    "query": {
      "match": {
        "status": "published"
      }
    }
  },
  "dest": {
    "index": "new_index"
  }
}

Essentially, the src would accept anything that a search request would accept, and dest would be used to configure the bulk request.

Routing could be changed or removed as follows:

POST _index_by_search
{
  "src": {
    "index": ["index_1", "index_2"],
    "routing": ["foo", "bar"],
    "query": {
      "match": {
        "status": "published"
      }
    }
  },
  "dest": {
    "index": "new_index",
    "routing": null
  }
}

The version_type parameter could be used as follows:

  • external: set the new version to the same value as the existing version (and fail if the document already exists)
  • internal: update the document only if it exists and is the same version as retrieved by the scroll request
  • none: write the new document regardless of whether it exists or not, or whether it has already been updated or not

@nik9000
Copy link
Member Author

nik9000 commented Dec 1, 2015

Initially I was a fan of specifying the index and type in the URL, but now I think it is confusing.

I think I'm with you. I'll move it to the single endpoint.

none: write the new document regardless of whether it exists or not, or whether it has already been updated or not

We actually have force for that now.

"routing": null

Right now it copies routing values if they are set. I figure that should be the default. Maybe call it "routing": "keep" and make "routing": "discard" cause it to just throw the routing away.

external: set the new version to the same value as the existing version (and fail if the document already exists)

Can we follow the usual external semantics and write the document only if the version is newer?

internal: update the document only if it exists and is the same version as retrieved by the scroll request

I figured internal would just work like internal versioning works in bulk requests and write the document with version 1 if it doesn't exist.

One thing that comes up when I play with this is that its simpler to just support "index" requests in the bulk, at least for now. I'll think about it some more because update requests would be useful because they'd let you use scripts for free. You could probably get away with only returning smaller portions of the document during the query too.

@clintongormley
Copy link
Contributor

none: write the new document regardless of whether it exists or not, or whether it has already been updated or not
We actually have force for that now.

force is a slightly different, in that it will forcibly set the version to the specified value (which could cause the version number to drop), while what I was envisaging was just incrementing the version number, the same way we would if a version hadn't been specified.

"routing": null
Right now it copies routing values if they are set. I figure that should be the default. Maybe call it "routing": "keep" and make "routing": "discard" cause it to just throw the routing away.

The nice thing about routing: null is that you could also set it to routing: foo.

Can we follow the usual external semantics and write the document only if the version is newer?

Not if you're indexing into the same index - nothing would be updated. The use case I'm seeing here is: we use external versioning so eg a db is the source of the version number, now we want to add some multi-field with a new mapping and backfill existing docs, so we reindex but keep the version number the same.

We could use force to index the doc regardless and to keep the same version number, but that wouldn't allow for skipping newer docs... not sure if this is an issue.

I figured internal would just work like internal versioning works in bulk requests and write the document with version 1 if it doesn't exist.

This isn't how it works. Try this:

PUT t/t/1?version=4
{}

You'll get a conflict exception because the document doesn't exist. Instead we could use none (which should just delete the version number before indexing).

This versioning thing is complex. We should probably have a brainstorming session to make sure that we nail it down correctly.

@bleskes
Copy link
Contributor

bleskes commented Dec 2, 2015

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

For another index it's tricky. If people use external versioning in general, then it's a good fit. If they don't and we concurrently index into a remote while other processes also index it then external version semantics doesn't really make sense . It doesn't mean much that the version in the source index is higher than the one in the target index. In that case I think the only two options are either create (i.e., only reindex if the document doesn't exist) or override (i.e., always index, no guarantees over which copy survives).

In all cases I don't think force makes sense. It can be used as a measure of last resort where people use one index as a source of truth to another and want to override existing docs, but it's super expert and dangerous.

@clintongormley
Copy link
Contributor

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

We probably also need the ability to specify what should happen if there is a version conflict: ignore and continue, or throw an exception.

@nik9000
Copy link
Member Author

nik9000 commented Dec 2, 2015

The nice thing about routing: null is that you could also set it to routing: foo.

I implemented routing: keep, routing: discard and routing: =foo last night. It feels a bit better than relying unset being different from null. I'm happy to iterate on it. Its simple to change.

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

But that'll bump the version number which isn't what someone using external versioning wants. I wonder if those folks are just out of luck here?

either create (i.e., only reindex if the document doesn't exist) or override (i.e., always index, no guarantees over which copy survives).

I think this'd work fine for now. The big use case for writing to an other index is when someone wants to change analysis or sharding.

@bleskes
Copy link
Contributor

bleskes commented Dec 2, 2015

when re-indexing into the same index, internal versioning is what I think we need, i.e., only re-index if it has changed.

But that'll bump the version number which isn't what someone using external versioning wants. I wonder if those folks are just out of luck here?

Good point and the first valid use for EXTERNAL_GTE I heard :) .In general I was talking about the defaults for the operations that make sense. In general I think we should just allow people to specify what ever versioning support they want.

@@ -0,0 +1 @@
This plugin has no third party dependencies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory/file is no longer needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove.

This creates an index-by-search plugin with a very basic, shell of an
implementation of that is very like delete-by-query. At some point we'll
integrate it with the task managament work, but for now this works.
Fixes to make the example work
Adds a test for routing in general and a test for the grandparent case.
Small fix around sorts, validation, etc.
@nik9000
Copy link
Member Author

nik9000 commented Dec 2, 2015

Rebased so I could remove the file. I figure rebase is ok here because no one is actively reviewing the code.

}
--------------------------------------------------

The `src` parameter can also be specified as `source` for those that like that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the naming simple : src and dest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See, I think "source" is more simple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also be happy with source and dest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@nik9000
Copy link
Member Author

nik9000 commented Jan 12, 2016

@s1monw I've pushed a few test changes that move some of the tests from integration to unit tests - tests for failure states and tests for how scripts interact with the request. I think some things should remain integration tests - some smoke tests for scripts, the test that update_by_query never reverts changes from another updater, some of the basic smoke tests. But I'll work to make more things unit tests where possible. How far do you think I should go before I merge this into the feature branch and get to work on integrating it with task management? Task management will involve lots more unit tests for things like cancel-ability, status reporting, and eventually throttling.

preserved unless it's changed by scripting. You can set `routing` on the `dest`
request to change this:

`keep`::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meh, i don't like these options. I'd prefer to go with:

  • routing is preserved by default
  • set it to null to clear
  • set it to another string to set it to that string
  • use a script to make a decision per doc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set it to null to clear

This means null is different from the default. Are we ok with that?

@clintongormley
Copy link
Contributor

Just looked through the docs - looking awesome! I assume you're planning to move this to a module rather than a plugin before merging.

@nik9000
Copy link
Member Author

nik9000 commented Jan 12, 2016

I assume you're planning to move this to a module rather than a plugin before merging.

I dunno what the plan is. I'm happy to module-ify it before merging it to master. Right now I just want to get this into the feature branch so I can start working on task management.

@s1monw
Copy link
Contributor

s1monw commented Jan 13, 2016

@nik9000 thanks for working on the tests - I think we can improve over time so if nobody objects lets get this going and move it into master?

@nik9000
Copy link
Member Author

nik9000 commented Jan 13, 2016

move it into master

I believe the plan is to integrate it with task management, probably move it to a module, and then move it to master. But I'll certainly merge to the feature branch.

@nik9000 nik9000 merged this pull request into elastic:feature/reindex Jan 13, 2016
@nik9000
Copy link
Member Author

nik9000 commented Jan 13, 2016

Now that I've merged this to feature/reindex I'm actually going to squash it to one commit. I should have done that before merging but I got excited to click the big green button.

@danielmitterdorfer
Copy link
Member

Incredible, the first huge step is done! Congratulations! :)

@tlrx
Copy link
Member

tlrx commented Jan 14, 2016

Well done @nik9000 !

@lcawl lcawl added :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. and removed :Reindex API labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search.
Projects
None yet
Development

Successfully merging this pull request may close these issues.