Store XContentType of source in translog and index #22863

jaymode · 2017-01-30T15:46:49Z

As part of #22691 (comment), more automated content type detection places were discovered and one of these is in the Translog. When we store an Index operation in the translog, we only store the bytes and do not store the xcontent type so we rely on content type detection for replaying the translog.

We can modify the translog format to store the xcontent type and use that when reading from the translog. However, we then lose the type information when storing data in the index. Should we also store the xcontent type in the index so that we can make use of it when returning responses and avoiding auto detection there as well?

rjernst · 2017-01-30T19:33:02Z

However, we then lose the type information when storing data in the index

Do you mean _source? Isn't this always json? Or, it should be?

javanna · 2017-01-30T19:57:23Z

Do you mean _source? Isn't this always json? Or, it should be?

I was not sure either, but as far as I can see, and I hope I am wrong, we simply store the bytes array of the incoming document, whatever content type that is. the conversion from a type to another happens when returning the source.

jaymode · 2017-01-30T19:58:21Z

Do you mean _source? Isn't this always json?

Correct _source. I just traced through the code to be sure and we take the bytes from the user and store them in the source field without modification; if the source is filtered we still convert it back to the original type.

Or, it should be?

I think that's tricky since the idea of _source is to store the original document so it can retrieved later on. From our docs:

The _source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing fetch requests, like get or search.

Maybe we intended it to always be JSON at some point or it is a docs bug...

rjernst · 2017-01-30T20:08:50Z

The debate about whether _source should be the exact original bytes has been discussed in many issues. One in particular would allow us to remove the stored feature of mappings, and possibly compress better: #9034. I only bring this up because any solutions here seem complex (eg having an additional field _source_content_type..what happens if it is not correctly updated? What happens with update scripts?), so maybe we can avoid the need for it and get better compression with a similar amount of work.

bleskes · 2017-02-03T13:51:04Z

we discussed this on fix it friday and as before this boils down to knowing how much performance boost (if at all) alternative/binary JSON formats (most notably Smile) gives us. This will help motivate keeping the current accept all methodology or give us the data we need to remove it and settle on, probably, textual JSON. I will talk to @danielmitterdorfer to see when we can make this happen.

javanna · 2017-02-04T09:44:03Z

Double checking (as remove was mentioned above) that this issue is not about removing support for any currently supported format. I'd expect the outcome of this issue to be about whether it is necessary to store in the translog and in the lucene index exactly the same bytes that were received, in any supported content type, rather than settling on a format and handling conversions internally.

bleskes · 2017-02-06T10:11:21Z

@javanna Indeed this is issue is about explicitly storing the content type of a document alongside wherever we store the original documents bytes (translog and lucene). This is needed if we are going to remove the automatic content detection but are going to keep supporting multiple formats. However, if end up removing supporting multiple formats (or converting convenience formats like YAML to JSON) we don't need this explicit information. I hope this clarifies why these are tied together.

javanna · 2017-02-06T11:06:28Z

Things are clear on my end, yet I think I should clarify that we should not discuss again if we want to remove support for multiple formats at REST, cause that won't happen anytime soon, see #22811 . The two options here are 1) we add the type to translog and index 2) we settle on one internal format and internally convert all of the others (for bw comp we will still need some auto-detection).

dnhatn · 2018-03-27T16:35:14Z

Closing for now as we are in favor to remove support for CBOR, Smile and Yaml. We will reconsider this in the future.

dnhatn · 2018-03-27T19:28:09Z

To be clear, this was closed in favour of #22811.

javanna · 2022-01-12T12:48:35Z

Given that we closed #22811 as a won't fix and this was closed in favour of it, I think we should reconsider what we need to do here. The information about the content type is still not stored in the translog and guessed when reading from the translog, using a method (XContentHelper#xContentType(BytesReference)) that we deprecated years ago. Maybe it is just a matter of embracing that we are not going to get rid of automatic type detection in some places, but it would be good then to un-deprecate the needed methods as they are going to stay around for the time being.

elasticsearchmachine · 2022-08-18T13:00:05Z

Pinging @elastic/es-distributed (Team:Distributed)

jaymode added :Translog discuss labels Jan 30, 2017

jaymode mentioned this issue Jan 30, 2017

Optionally require a valid content type for all rest requests with content #22691

Merged

dnhatn closed this as completed Mar 27, 2018

javanna reopened this Jan 12, 2022

javanna added team-discuss and removed discuss labels Jan 12, 2022

original-brownbear added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 18, 2022

original-brownbear added >tech debt and removed team-discuss labels Sep 6, 2022

tlrx self-assigned this Sep 28, 2022

tlrx mentioned this issue Sep 29, 2022

Add XContentType of source in translog #90498

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store XContentType of source in translog and index #22863

Store XContentType of source in translog and index #22863

jaymode commented Jan 30, 2017 •

edited

Loading

rjernst commented Jan 30, 2017

javanna commented Jan 30, 2017

jaymode commented Jan 30, 2017

rjernst commented Jan 30, 2017 •

edited

Loading

bleskes commented Feb 3, 2017

javanna commented Feb 4, 2017

bleskes commented Feb 6, 2017

javanna commented Feb 6, 2017

dnhatn commented Mar 27, 2018

dnhatn commented Mar 27, 2018

javanna commented Jan 12, 2022

elasticsearchmachine commented Aug 18, 2022

Store XContentType of source in translog and index #22863

Store XContentType of source in translog and index #22863

Comments

jaymode commented Jan 30, 2017 • edited Loading

rjernst commented Jan 30, 2017

javanna commented Jan 30, 2017

jaymode commented Jan 30, 2017

rjernst commented Jan 30, 2017 • edited Loading

bleskes commented Feb 3, 2017

javanna commented Feb 4, 2017

bleskes commented Feb 6, 2017

javanna commented Feb 6, 2017

dnhatn commented Mar 27, 2018

dnhatn commented Mar 27, 2018

javanna commented Jan 12, 2022

elasticsearchmachine commented Aug 18, 2022

jaymode commented Jan 30, 2017 •

edited

Loading

rjernst commented Jan 30, 2017 •

edited

Loading