Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink how to index nested documents when types are gone #24362

Closed
jpountz opened this issue Apr 27, 2017 · 5 comments · Fixed by #51100
Closed

Rethink how to index nested documents when types are gone #24362

jpountz opened this issue Apr 27, 2017 · 5 comments · Fixed by #51100
Labels
help wanted adoptme >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@jpountz
Copy link
Contributor

jpountz commented Apr 27, 2017

Currently nested documents reuse the _type field in order to identify root documents from children: the value of the type field starts with __ for children. However with types going away, we would probably like to stop indexing the _type field so we would need to find other ways to identify root/children.

For efficiency reasons, we need a fast query that identifies root documents. One way we could do that would be by adding a special field/value pair to root documents. However, we would probably only want to do that when there are nested mappings, which means we would have to reject adding nested objects to existing mappings.

@jpountz jpountz added :Search Foundations/Mapping Index mappings, including merging and defining field types discuss labels Apr 27, 2017
@martijnvg
Copy link
Member

I think this is a fair tradeoff. Especially since it would still allow the addition of additional nested fields, so it would only affect indices where no nested fields have been defined.

@jpountz
Copy link
Contributor Author

jpountz commented May 5, 2017

Discussed in Fixit-Friday: we will benchmark the overhead of a field that always has the same value. If it does not matter, we can add a meta-data field that identifies nested documents, otherwise we can either:

  • enforce nested mappings to be configured at index-creation time and only add the metadata field for root docs when there are nested mappings
  • or add an option to disable the fied that identifies root documents so that users who do not need the feature can get some performance back

@s1monw
Copy link
Contributor

s1monw commented Nov 15, 2017

I looked at this yesterday and after a closer look I think we can get the best of both worlds without indexing / adding anything to the documents. Today we index _version, _seq_id, and _primary_term for every nested document. While this is unnecessary and we can omit it we can use a DocValuesFieldExistsQuery once we don't index this for nested document. The values for these fields are all dummy values anyway so they have no real value for the user. Yet, when we look at the place where these dummy values are added we see that there is a reason for this:

https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/index/mapper/SeqNoFieldMapper.java#L247

// In the case of nested docs, let's fill nested docs with seqNo=1 and
// primaryTerm=0 so that Lucene doesn't write a Bitset for documents
// that don't have the field. This is consistent with the default value
// for efficiency.

This is a good reason for at least _seq_id and _version but given that we only access _primary_term rarely it might be ok to drop the value for nested docs in such a case. This would allow us to find parents very easily without paying any extra cost on indexing.

@bleskes I wonder what you think if this is a feasible solution?

s1monw added a commit to s1monw/elasticsearch that referenced this issue Nov 21, 2017
This change stops indxing the `_primary_term` field for nested documents
to allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.

Relates to elastic#24362
s1monw added a commit that referenced this issue Nov 21, 2017
This change stops indexing the `_primary_term` field for nested documents
to allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.

Relates to #24362
s1monw added a commit that referenced this issue Nov 21, 2017
This change stops indexing the `_primary_term` field for nested documents
to allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.

Relates to #24362
@javanna
Copy link
Member

javanna commented Mar 16, 2018

@elastic/es-search-aggs

@jpountz
Copy link
Contributor Author

jpountz commented Sep 4, 2018

We discussed it, this issue incorporated two things:

  • the need to identify root documents efficiently, which @s1monw addressed as explained above
  • the removal of the _type field, which still needs to be addressed

We agreed to rename _type to _nested to address the second point.

@jtibshirani jtibshirani mentioned this issue Apr 15, 2019
66 tasks
romseygeek added a commit that referenced this issue Jan 22, 2020
Currently nested documents repurpose the _type field to store their nested paths.
This commit adds a dedicated _nested_path field instead, which decouples this
information from types and will allow the removal of the _type field entirely further
down the line. To preserve backwards compatibility, references to this field are
mediated via methods that take an index settings object, and indexes created before
8x still use the _type field.

Relates to #41059
Closes #24362
@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted adoptme >non-issue :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
5 participants