Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store XContentType of source in translog and index #22863

Open
jaymode opened this issue Jan 30, 2017 · 12 comments
Open

Store XContentType of source in translog and index #22863

jaymode opened this issue Jan 30, 2017 · 12 comments
Assignees
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >tech debt

Comments

@jaymode
Copy link
Member

jaymode commented Jan 30, 2017

As part of #22691 (comment), more automated content type detection places were discovered and one of these is in the Translog. When we store an Index operation in the translog, we only store the bytes and do not store the xcontent type so we rely on content type detection for replaying the translog.

We can modify the translog format to store the xcontent type and use that when reading from the translog. However, we then lose the type information when storing data in the index. Should we also store the xcontent type in the index so that we can make use of it when returning responses and avoiding auto detection there as well?

@rjernst
Copy link
Member

rjernst commented Jan 30, 2017

However, we then lose the type information when storing data in the index

Do you mean _source? Isn't this always json? Or, it should be?

@javanna
Copy link
Member

javanna commented Jan 30, 2017

Do you mean _source? Isn't this always json? Or, it should be?

I was not sure either, but as far as I can see, and I hope I am wrong, we simply store the bytes array of the incoming document, whatever content type that is. the conversion from a type to another happens when returning the source.

@jaymode
Copy link
Member Author

jaymode commented Jan 30, 2017

Do you mean _source? Isn't this always json?

Correct _source. I just traced through the code to be sure and we take the bytes from the user and store them in the source field without modification; if the source is filtered we still convert it back to the original type.

Or, it should be?

I think that's tricky since the idea of _source is to store the original document so it can retrieved later on. From our docs:

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

Maybe we intended it to always be JSON at some point or it is a docs bug...

@rjernst
Copy link
Member

rjernst commented Jan 30, 2017

The debate about whether _source should be the exact original bytes has been discussed in many issues. One in particular would allow us to remove the stored feature of mappings, and possibly compress better: #9034. I only bring this up because any solutions here seem complex (eg having an additional field _source_content_type..what happens if it is not correctly updated? What happens with update scripts?), so maybe we can avoid the need for it and get better compression with a similar amount of work.

@bleskes
Copy link
Contributor

bleskes commented Feb 3, 2017

we discussed this on fix it friday and as before this boils down to knowing how much performance boost (if at all) alternative/binary JSON formats (most notably Smile) gives us. This will help motivate keeping the current accept all methodology or give us the data we need to remove it and settle on, probably, textual JSON. I will talk to @danielmitterdorfer to see when we can make this happen.

@javanna
Copy link
Member

javanna commented Feb 4, 2017

Double checking (as remove was mentioned above) that this issue is not about removing support for any currently supported format. I'd expect the outcome of this issue to be about whether it is necessary to store in the translog and in the lucene index exactly the same bytes that were received, in any supported content type, rather than settling on a format and handling conversions internally.

@bleskes
Copy link
Contributor

bleskes commented Feb 6, 2017

@javanna Indeed this is issue is about explicitly storing the content type of a document alongside wherever we store the original documents bytes (translog and lucene). This is needed if we are going to remove the automatic content detection but are going to keep supporting multiple formats. However, if end up removing supporting multiple formats (or converting convenience formats like YAML to JSON) we don't need this explicit information. I hope this clarifies why these are tied together.

@javanna
Copy link
Member

javanna commented Feb 6, 2017

Things are clear on my end, yet I think I should clarify that we should not discuss again if we want to remove support for multiple formats at REST, cause that won't happen anytime soon, see #22811 . The two options here are 1) we add the type to translog and index 2) we settle on one internal format and internally convert all of the others (for bw comp we will still need some auto-detection).

@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Feb 13, 2018
@dnhatn
Copy link
Member

dnhatn commented Mar 27, 2018

Closing for now as we are in favor to remove support for CBOR, Smile and Yaml. We will reconsider this in the future.

@dnhatn dnhatn closed this as completed Mar 27, 2018
@dnhatn
Copy link
Member

dnhatn commented Mar 27, 2018

To be clear, this was closed in favour of #22811.

@javanna
Copy link
Member

javanna commented Jan 12, 2022

Given that we closed #22811 as a won't fix and this was closed in favour of it, I think we should reconsider what we need to do here. The information about the content type is still not stored in the translog and guessed when reading from the translog, using a method (XContentHelper#xContentType(BytesReference)) that we deprecated years ago. Maybe it is just a matter of embracing that we are not going to get rid of automatic type detection in some places, but it would be good then to un-deprecate the needed methods as they are going to stay around for the time being.

@javanna javanna reopened this Jan 12, 2022
@original-brownbear original-brownbear added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 18, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >tech debt
Projects
None yet
Development

No branches or pull requests

9 participants