Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc values support to _parent field data #6107

Closed
jpountz opened this issue May 9, 2014 · 17 comments
Closed

Add doc values support to _parent field data #6107

jpountz opened this issue May 9, 2014 · 17 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented May 9, 2014

The _parent field can easily have a high cardinality, as a consequence field data for this field can take a lot of memory. It would be useful to have the ability to store this mapping on disk using Lucene doc values.

Doc values proved to perform very well for aggregations in combination with global ordinals (#5672). So now that parent/child queries use global ordinals as well (#5846) I think doc values could even be the default for the _parent field?

@martijnvg
Copy link
Member

+1! The tricky bit here is that the ParentChildIndexFieldData is a combination between the _uid and the _parent field, so I think at index time the doc values field for _parent field data needs to be based on both of these fields.

@martijnvg
Copy link
Member

Doc values for parent child isn't so tricky. The parent child doc values field should just contain the values of both the _uid and _parent field. Each parent type should have its own parent/child doc values field (type type name can be used as suffix for the doc values field). This logic can be implemented in ParentFieldMapper#parseCreateField

The ParentChildIndexFieldData should just check if a parent/child doc values field exists for a type. If that isn't the case it should just load the ParentChildAtomicFieldData based on the _uid and _parent field what it does today.

@jpountz
Copy link
Contributor Author

jpountz commented May 9, 2014

This sounds good to me!

@ostersc
Copy link

ostersc commented Aug 5, 2014

Will this actually address the problem raised in #3516 (as it was closed in favor of this one)? This seems to imply a reduction in the space needed for _parent, but our problem is that the id_cache is growing linearly with the number of parents, even if there are no children.

@clintongormley
Copy link
Contributor

@ostersc the id_cache has been removed and instead the p/c data is stored in the fielddata cache. it still uses up memory. this issue is about moving the p/c data to disk, which will save a lot of memory.

@ostersc
Copy link

ostersc commented Aug 5, 2014

@clintongormley thanks for clarifying. yes, we are seeing this issue manifest in the fielddata_breaker, but in verifying it was related to parent/child issues, we ran
GET /_nodes/stats/indices/id_cache?human
and found all the memory residing in:
"indices": {
"id_cache": {
"memory_size":

@clintongormley
Copy link
Contributor

@ostersc yes, that's just for bwc. the actual store is in the fielddata.

@ostersc
Copy link

ostersc commented Aug 5, 2014

@clintongormley gotcha. Is there any known work around for this? I was quite surprised to find the field cache grow for each parent doc (even if there are no children), so we are looking needing to move away from using parent-child as we need to support hundreds of millions of parent docs with sparsely populated children.

@clintongormley
Copy link
Contributor

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Aug 10, 2014
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Aug 11, 2014
On indices created on or after 1.4.0 will store the _parent field also as doc values.
Also added `index._parent.doc_values` option which controls whether doc values are used for parent/child field data, if set to false parent/child field data will be created on the fly based on _parent field inverted index.
The `index._parent.doc_values` defaults to true.

Closes elastic#6107
Closes elastic#6511
@gmenegatti
Copy link

+1

@martijnvg martijnvg removed the v1.5.0 label Jan 30, 2015
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Mar 25, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 19, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 19, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 2, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 10, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) top_children query will be removed. The top_children query was somewhat an alternative to has_child when it came to speed, but it isn't accurate and wasn't always faster.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 10, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 11, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 15, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 19, 2015
This a breaking change:
1) A parent type needs be marked as parent in the _parent field mapping of the parent type.
2) The has_child and has_parent queries can't be used in index aliases any more, because during query parse time it requires the search context to be set. During normal _search api usage this is the case, but not when adding an index alias.

Indices created before 2.0 will use field data and the old way of executing queries, but indices created on or after 2.0 will use the Lucene join and encode the parent/child relation at index time in a special join doc values field.

Closes elastic#6107
Closes elastic#6511
Closes elastic#8134
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 29, 2015
* Cut the `has_child` and `has_parent` queries over to use Lucene's query time global ordinal join. The main benefit of this change is that parent/child queries can now efficiently execute if parent/child queries are wrapped in a bigger boolean query. If the rest of the query only hit a few documents both has_child and has_parent queries don't need to evaluate all parent or child documents any more.
* Cut the `_parent` field over to use doc values. This significantly reduces the on heap memory footprint of parent/child, because the parent id values are never loaded into memory.

Breaking changes:
* The `type` option on the `_parent` field can only point to a parent type that doesn't exist yet, so this means that an existing type/mapping can't become a parent type any longer.
* The `has_child` and `has_parent` queries can no longer be use in alias filters.

All these changes, improvements and breaks in compatibility only apply for indices created with ES version 2.0 or higher. For indices creates with ES <= 2.0 the older implementation is used.

It is highly recommended to re-index all your indices with parent and child documents to benefit from all the improvements that come with this refactoring. The easiest way to achieve this is by using the scan and bulk apis using a simple script.

Closes elastic#6107
Closes elastic#8134
@tmcerwin
Copy link

tmcerwin commented Jun 4, 2015

Is there any estimate on when the 2.0 version will be released? We will likely have to remove usage of parent / child relationships, until they are moved to doc-values. The in memory relationships are not scaling well with our application.

@jpountz
Copy link
Contributor Author

jpountz commented Jun 4, 2015

As you can see from #9970 there are 3 remaining boxes to tick and good progress is being made these days, so the first release candidate should happen pretty soon. However it might still take time between the release candidate and the GA depending on feedback.

@tmcerwin
Copy link

tmcerwin commented Jun 4, 2015

Thanks for the quick response! I will run some testing on the master branch.

@ghost
Copy link

ghost commented Aug 19, 2015

Is this still happening for 2.0?

@martijnvg
Copy link
Member

@alexkavon yes, doc values support for parent/child will be included in the first 2.0 release. If you want to try it out, just make sure that you create a new index once upgraded to 2.0. The doc values support is only enabled on indices created on or after version 2.0. Indices that existed before the upgrade to 2.0 will remain to work and perform in the same way they did on previous 1.x releases.

@ghost
Copy link

ghost commented Sep 2, 2015

So a reindexing should take care of this then?

@martijnvg
Copy link
Member

@alexkavon Yes, once upgraded to 2.0.x a reindex would use the new p/c implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants