-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor parent/child to not be Lucene queries #8134
Comments
+1 to make the parent/child logic less fragile. Just wondering how these two separate searches are executed:
|
Wondering if it would be possible to take parent-level filters into account within the child queries. For instance, imagine you have a filter on the parent which matches only 1% of your parents, but the |
+1 one million times @martijnvg @clintongormley |
The optimization @clintongormley suggests would be very beneficial for our use-cases in Totango. Probably also @martijnvg's. Currently, the latency for such queries is not acceptable product-wise, and we'll have to work around with ugly denormalization. What we really want, is to have the parent-child queries to work faster. So, I am happy that you're pushing for it. |
I too ran into the issue described in #5116 using rescore with has_child. Performing the query X times for each set of rescore works and is the difference between 60ms (each rescore) versus 10,000ms for my data, but it would be more ideal to just use the parent's data set during rescore or even better if function_score in the original query supported summing child document values to generate the score on each node before combining on the cluster. |
I tried to think more about it and essentially:
|
@jpountz I think that option 2 is fair. The terms collected in phase1 are always tied to a given IndexSearch/DirectoryReader, so if that changes the collected terms are invalid. I think for the percolator we should forbid the usage of parent/child queries. What I also wanted to tackle with this issue is making parent/child queries smarter in terms of execution. Currently most of the times parent/child queries need either evaluate all parent or children documents (depending on whether The best thing I can come up with is to make the parent/child queries execute lazily. So for example if a parent document matches with all other queries we push it to has_child and check it has a child document and if it matches with the inner query. Maybe we need to buffer all matching parent docs in order to make this efficient. The easiest place this approach could be applied is if has_child was defined as a post filter. If has_child was defined in a bool query in the main query applying this approach is trickier, because of how the query execution works at the moment. |
I believe this could be done with two-phase iteration that has been introduced in Lucene 5.1: https://issues.apache.org/jira/browse/LUCENE-6198 We just discussed this issue a bit more and agreed on the following points:
|
Cool, with that the second phase matching logic in
+1 I think we should rewrite the JoinUtil in the join module to do what |
+1 this sounds great. |
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
What are your plans on this task? Asking because it blocks #2917, which (at least for us) would save an enormous amount of effort and maintenance involved in roundabout solutions. |
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134
Does your commit mean that you've solved this? If so, that's really great. |
* Cut the `has_child` and `has_parent` queries over to use Lucene's query time global ordinal join. The main benefit of this change is that parent/child queries can now efficiently execute if parent/child queries are wrapped in a bigger boolean query. If the rest of the query only hit a few documents both has_child and has_parent queries don't need to evaluate all parent or child documents any more. * Cut the `_parent` field over to use doc values. This significantly reduces the on heap memory footprint of parent/child, because the parent id values are never loaded into memory. Breaking changes: * The `type` option on the `_parent` field can only point to a parent type that doesn't exist yet, so this means that an existing type/mapping can't become a parent type any longer. * The `has_child` and `has_parent` queries can no longer be use in alias filters. All these changes, improvements and breaks in compatibility only apply for indices created with ES version 2.0 or higher. For indices creates with ES <= 2.0 the older implementation is used. It is highly recommended to re-index all your indices with parent and child documents to benefit from all the improvements that come with this refactoring. The easiest way to achieve this is by using the scan and bulk apis using a simple script. Closes elastic#6107 Closes elastic#8134
Parent/child queries have a non-desirable property: given a parent/child query Q, updating a document in segment A might change the set of matching document in another segment B.
This is an issue because it means that parent/child queries and filters cannot be cached per segment, so we had to add logic to make sure these queries don't get cached, either directly or as part of a cached parent filter (eg. under a cached
bool
filter). The propagation logic can be a bit fragile so I think we should work on a better fix.One idea could be to change the abstraction we have to match document from a single Lucene query to something that could perform several Lucene queries. For instance in the case of
has_child
, we could have a first query that would collect parent ids and then build a new query based on these ids. This is the same execution logic, but each query on its own would solely depend on data that is stored in the current segment, so they would be cacheable (even though it might not a good idea to cache them).The text was updated successfully, but these errors were encountered: