Refactor parent/child to not be Lucene queries #8134

jpountz · 2014-10-17T12:24:00Z

Parent/child queries have a non-desirable property: given a parent/child query Q, updating a document in segment A might change the set of matching document in another segment B.

This is an issue because it means that parent/child queries and filters cannot be cached per segment, so we had to add logic to make sure these queries don't get cached, either directly or as part of a cached parent filter (eg. under a cached bool filter). The propagation logic can be a bit fragile so I think we should work on a better fix.

One idea could be to change the abstraction we have to match document from a single Lucene query to something that could perform several Lucene queries. For instance in the case of has_child, we could have a first query that would collect parent ids and then build a new query based on these ids. This is the same execution logic, but each query on its own would solely depend on data that is stored in the current segment, so they would be cacheable (even though it might not a good idea to cache them).

The text was updated successfully, but these errors were encountered:

martijnvg · 2014-10-20T21:08:17Z

+1 to make the parent/child logic less fragile.

Just wondering how these two separate searches are executed:

We could run the first search in the query parser and pass the matching parent ordinals (optionally with scores) as argument to the query/filter that is returned by the parser.
We could back to how has_child was executed around version 0.19 where there was scoped queries infrastructure. If a query implemented ScopePhase interface it ran twice as separate queries. First just the inner query ran and it collected parent ids, then the actual has_child query ran and only emitted parent docs that had a parent id that was collected before. Running with two distinct queries would mean that we wouldn't run into this filter cache issue. I believe this mechanism was removed because it added additional complexity in the query phase that was unwanted at that time. Maybe was can bring it back to live and simplify it.

clintongormley · 2014-10-21T08:08:16Z

Wondering if it would be possible to take parent-level filters into account within the child queries. For instance, imagine you have a filter on the parent which matches only 1% of your parents, but the has_child query/filter matches 90% of your docs. Currently, the has_child clause can't take advantage of the parent-level filter and so has to match all 90% of parents regardless.

gmenegatti · 2014-10-22T21:02:36Z

+1 one million times @martijnvg @clintongormley
For a company like ours that currently has more than 500 million parents and in few months around 60 billion children, and where the goal is to count and filter the number of parent based on some data that is stored on the child, this is definitely important.

iddoav · 2014-10-25T07:35:00Z

The optimization @clintongormley suggests would be very beneficial for our use-cases in Totango. Probably also @martijnvg's. Currently, the latency for such queries is not acceptable product-wise, and we'll have to work around with ugly denormalization. What we really want, is to have the parent-child queries to work faster. So, I am happy that you're pushing for it.

RossLieberman · 2015-01-28T19:03:13Z

I too ran into the issue described in #5116 using rescore with has_child. Performing the query X times for each set of rescore works and is the difference between 60ms (each rescore) versus 10,000ms for my data, but it would be more ideal to just use the parent's data set during rescore or even better if function_score in the original query supported summing child document values to generate the score on each node before combining on the cluster.

jpountz · 2015-02-13T08:51:47Z

I tried to think more about it and essentially:

the current design has the issue that queries are not cacheable per segment because whether a document matches a document depends on what happens in other segments as well
as a work-around, we could run the first phase (collecting ids) in the query parser and then return a query for the second phase (which would essentially be a fielddata terms filter with scores for each term). However this query would be specific to a given index searcher and it would not work for things that want to keep queries alive for a long time like filtered aliases or the percolator
using a different abstraction is tempting, but then you could only use parent/child queries as top-level queries (ie. no more nesting in boolean queries)

martijnvg · 2015-02-13T09:17:15Z

@jpountz I think that option 2 is fair. The terms collected in phase1 are always tied to a given IndexSearch/DirectoryReader, so if that changes the collected terms are invalid. I think for the percolator we should forbid the usage of parent/child queries.

What I also wanted to tackle with this issue is making parent/child queries smarter in terms of execution. Currently most of the times parent/child queries need either evaluate all parent or children documents (depending on whether has_child or has_parent is used) if there is a match with the inner query, which is unneeded because usually parent/child queries are part of a bigger query and therefor not all parents/children will ever match.

The best thing I can come up with is to make the parent/child queries execute lazily. So for example if a parent document matches with all other queries we push it to has_child and check it has a child document and if it matches with the inner query. Maybe we need to buffer all matching parent docs in order to make this efficient. The easiest place this approach could be applied is if has_child was defined as a post filter. If has_child was defined in a bool query in the main query applying this approach is trickier, because of how the query execution works at the moment.

jpountz · 2015-02-20T10:06:44Z

The best thing I can come up with is to make the parent/child queries execute lazily. So for example if a parent document matches with all other queries we push it to has_child and check it has a child document and if it matches with the inner query.

I believe this could be done with two-phase iteration that has been introduced in Lucene 5.1: https://issues.apache.org/jira/browse/LUCENE-6198

We just discussed this issue a bit more and agreed on the following points:

the limitations of option 2 are fair (no use in filtered aliases or the percolator)
we need to move the infrastructure to Lucene

martijnvg · 2015-02-22T21:54:37Z

I believe this could be done with two-phase iteration that has been introduced in Lucene 5.1

Cool, with that the second phase matching logic in has_child and has_parent then only needs to be executed for docs we know that are going to match. This should speed things up significantly.

we need to move the infrastructure to Lucene

+1 I think we should rewrite the JoinUtil in the join module to do what has_child and has_parent do.

rmuir · 2015-02-25T16:27:20Z

+1 this sounds great.

This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134

pauleil · 2015-05-06T18:02:31Z

What are your plans on this task? Asking because it blocks #2917, which (at least for us) would save an enormous amount of effort and maintenance involved in roundabout solutions.

This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134

This a breaking change: 1) A parent type needs be marked as parent in the _parent field mapping of the parent type. 2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias. Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field. Closes elastic#6107 Closes elastic#6511 Closes elastic#8134

pauleil · 2015-05-28T14:01:24Z

Does your commit mean that you've solved this? If so, that's really great.

* Cut the `has_child` and `has_parent` queries over to use Lucene's query time global ordinal join. The main benefit of this change is that parent/child queries can now efficiently execute if parent/child queries are wrapped in a bigger boolean query. If the rest of the query only hit a few documents both has_child and has_parent queries don't need to evaluate all parent or child documents any more. * Cut the `_parent` field over to use doc values. This significantly reduces the on heap memory footprint of parent/child, because the parent id values are never loaded into memory. Breaking changes: * The `type` option on the `_parent` field can only point to a parent type that doesn't exist yet, so this means that an existing type/mapping can't become a parent type any longer. * The `has_child` and `has_parent` queries can no longer be use in alias filters. All these changes, improvements and breaks in compatibility only apply for indices created with ES version 2.0 or higher. For indices creates with ES <= 2.0 the older implementation is used. It is highly recommended to re-index all your indices with parent and child documents to benefit from all the improvements that come with this refactoring. The easiest way to achieve this is by using the scan and bulk apis using a simple script. Closes elastic#6107 Closes elastic#8134

jpountz added high hanging fruit discuss PITA labels Oct 17, 2014

clintongormley mentioned this issue Dec 29, 2014

poor performance of parent/child queries in function_score query and rescore #5116

Closed

martijnvg self-assigned this Mar 25, 2015

This was referenced Mar 25, 2015

Refactoring parent/child to improve memory consumption and query execution. #6511

Merged

Alias with filter is not reliable #10135

Closed

martijnvg mentioned this issue Apr 16, 2015

Sorting based on parent/child relationship #2917

Closed

clintongormley removed the discuss label May 28, 2015

martijnvg closed this as completed in 1cfb6a7 May 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor parent/child to not be Lucene queries #8134

Refactor parent/child to not be Lucene queries #8134

jpountz commented Oct 17, 2014

martijnvg commented Oct 20, 2014

clintongormley commented Oct 21, 2014

gmenegatti commented Oct 22, 2014

iddoav commented Oct 25, 2014

RossLieberman commented Jan 28, 2015

jpountz commented Feb 13, 2015

martijnvg commented Feb 13, 2015

jpountz commented Feb 20, 2015

martijnvg commented Feb 22, 2015

rmuir commented Feb 25, 2015

pauleil commented May 6, 2015

pauleil commented May 28, 2015

Refactor parent/child to not be Lucene queries #8134

Refactor parent/child to not be Lucene queries #8134

Comments

jpountz commented Oct 17, 2014

martijnvg commented Oct 20, 2014

clintongormley commented Oct 21, 2014

gmenegatti commented Oct 22, 2014

iddoav commented Oct 25, 2014

RossLieberman commented Jan 28, 2015

jpountz commented Feb 13, 2015

martijnvg commented Feb 13, 2015

jpountz commented Feb 20, 2015

martijnvg commented Feb 22, 2015

rmuir commented Feb 25, 2015

pauleil commented May 6, 2015

pauleil commented May 28, 2015