Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reflect latest changes in synthetic source documentation #109501

Merged
merged 15 commits into from
Jul 4, 2024

Conversation

lkts
Copy link
Contributor

@lkts lkts commented Jun 7, 2024

This PR changes synthetic source documentation to reflect recent improvements.

@lkts lkts added >docs General docs changes :StorageEngine/Mapping The storage related side of mappings labels Jun 7, 2024
Copy link
Contributor

github-actions bot commented Jun 7, 2024

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@@ -61,10 +60,8 @@ types:
** <<numeric-synthetic-source,`long`>>
** <<range-synthetic-source,`range` types>>
** <<numeric-synthetic-source,`scaled_float`>>
** <<search-as-you-type-synthetic-source,`search_as_you_type`>>
** <<numeric-synthetic-source,`short`>>
** <<text-synthetic-source,`text`>>
Copy link
Contributor Author

@lkts lkts Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text is controversial - if there is no keyword subfield it will store text in a field which is kind of the same as non-native approach

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a category "It's complicated" :P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo native means the field type handling synthetic source without relying on ignored source. It says nothing about how it implements synthetic source.

@lkts
Copy link
Contributor Author

lkts commented Jun 7, 2024

I wonder if we should expand "modifications" section on this page and mention f.e. precision loss in point fields.

@lkts
Copy link
Contributor Author

lkts commented Jun 7, 2024

Another question is if we should add section for synthetic source for all field types. It is not too useful in my opinion since it will just say "Synthetic source is supported by default.".


There are a couple of restrictions to be aware of:
<<synthetic-source-fields-native-list, Majority of field types>> natively support synthetic `_source`. Source for the field is constructed using existing data, most commonly <<doc-values,`doc_values`>>, meaning no additional space is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: s/Majority/ The majority/ ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Source for the field is usually reconstructed using doc values. However, sometimes stored fields are used, which can result in slight storage overhead."

@kkrik-es
Copy link
Contributor

kkrik-es commented Jun 8, 2024

Wanna mention something about objects (disabled, dynamic, ignore_*) and nested objects?

Also, shall we add the highlight note as mentioned in the mtg?

@kkrik-es
Copy link
Contributor

kkrik-es commented Jun 8, 2024

I wonder if we should expand "modifications" section on this page and mention f.e. precision loss in point fields.

++

@kkrik-es
Copy link
Contributor

kkrik-es commented Jun 8, 2024

Another question is if we should add section for synthetic source for all field types. It is not too useful in my opinion since it will just say "Synthetic source is supported by default.".

Yeah maybe use italics or smth to stress the by default part.

<<synthetic-source-modifications,modifications>> compared to the original JSON.
* Synthetic `_source` can be used with indices that contain only these field
types:
If a field type does not have native synthetic `_source` support, original data of this field is stored as is in the same way as the `_source` field in non-synthetic mode. Similarly, malformed values of fields that use <<ignore-malformed,`ignore_malformed`>> need to be stored as is, requiring additional storage space.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that the part of the _source is then stored in a special source field called _ignore_source?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we either explain somewhere the difference between native and non-native synthetic source...so that we can use the terminology later or we don't use such terminology and explain it differently like for instance:

"If a field type supports document reconstruction by means of doc values...otherwise, if doc values are not available the field value is stored as is..."

I am not in favor of one or the other...just saying that if we use "native synthetic source" we should explain it beforehand...or don't use it at all.

@@ -7,8 +7,7 @@ searchable), but it is stored so that it can be returned when executing
_fetch_ requests, like <<docs-get,get>> or <<search-search,search>>.

If disk usage is important to you then have a look at
<<synthetic-source,synthetic `_source`>> which shrinks disk usage at the cost of
only supporting a subset of mappings and slower fetches or (not recommended)
<<synthetic-source,synthetic `_source`>> which shrinks disk usage at the cost of slower fetches or (not recommended)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename at the cost of slower fetches or (not recommended) into at the cost of slower access to _source, in get and search APIs

@lkts
Copy link
Contributor Author

lkts commented Jun 10, 2024

@kkrik-es Yes, i'll add a highlight separately. We are still accumulating highlights :)

Copy link
Contributor

@shainaraskas shainaraskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This info is super helpful! 🎉 Added some minor grammar/flow recommendations.

@@ -54,7 +54,7 @@ documents, the document `_id` is a hash of the document's dimensions and
`@timestamp`. A TSDS doesn't support custom document `_id` values.

* A TSDS uses <<synthetic-source,synthetic `_source`>>, and as a result is
subject to a number of <<synthetic-source-restrictions,restrictions>>.
subject to <<synthetic-source-modifications,modifications>> applied to `_source` field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
subject to <<synthetic-source-modifications,modifications>> applied to `_source` field.
subject to <<synthetic-source-modifications,modifications>> applied to the `_source` field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the haunts me all my life, thank you

@@ -7,8 +7,7 @@ searchable), but it is stored so that it can be returned when executing
_fetch_ requests, like <<docs-get,get>> or <<search-search,search>>.

If disk usage is important to you then have a look at
Copy link
Contributor

@shainaraskas shainaraskas Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If disk usage is important to you then have a look at
If disk usage is important to you, then consider the following options:

Comment on lines 10 to 12
<<synthetic-source,synthetic `_source`>> which shrinks disk usage at the cost of slower fetches or (not recommended)
<<disable-source-field,disabling the `_source` field>> which also shrinks disk
usage but disables many features.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider defining these briefly up here. if you don't want to define them here, I'd remove the downsides and keep the links only.

Suggested change
<<synthetic-source,synthetic `_source`>> which shrinks disk usage at the cost of slower fetches or (not recommended)
<<disable-source-field,disabling the `_source` field>> which also shrinks disk
usage but disables many features.
- Using <<synthetic-source,synthetic `_source`>>, which reconstructs source content at the time of retrieval instead of storing it on disk. This shrinks disk usage, at the cost of slower access to `_source` in <<docs-get,Get>> and <<search-search,Search>> queries.
- <<disable-source-field,Disabling the `_source` field completely>>. This shrinks disk
usage but disables features that rely on `_source`.


There are a couple of restrictions to be aware of:
<<synthetic-source-fields-native-list, Majority of field types>> natively support synthetic `_source`. Source for the field is constructed using existing data, most commonly <<doc-values,`doc_values`>>, meaning no additional space is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<<synthetic-source-fields-native-list, Majority of field types>> natively support synthetic `_source`. Source for the field is constructed using existing data, most commonly <<doc-values,`doc_values`>>, meaning no additional space is used.
<<synthetic-source-fields-native-list,Most field types>> natively support synthetic `_source`. The source for natively supported fields is constructed using existing data, most commonly <<doc-values,`doc_values`>>, meaning that no additional space is used.

@lkts
Copy link
Contributor Author

lkts commented Jun 27, 2024

@elasticmachine update branch

@elasticsearchmachine
Copy link
Collaborator

Hi @lkts, I've created a changelog YAML for you. Note that since this PR is labelled release highlight, you need to update the changelog YAML to fill out the extended information sections.


[[synthetic-source-fields-native-list]]
===== Field types that support synthetic source with no storage overhead
Following field types support synthetic source using data from <<doc-values,`doc_values`>> and <<stored-fields, stored fields>> and require no additional storage space to construct `_source` field. Additional storage is required to store ignored field values produced as a result of using <<ignore-malformed,`ignore_malformed`>> or <<ignore-above,`ignore_above`>> parameters.
Copy link
Contributor

@kkrik-es kkrik-es Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Following/The following/
s/construct _source field/construct the _source field/



[[synthetic-source-fields-native-list]]
===== Field types that support synthetic source with no storage overhead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence on whether to report the fields that have additional indexing overhead when ignore_malformed or ignore_above are set. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fair, i imagine normally you expect malformed values to be a small portion of your data.

Copy link
Contributor

@kkrik-es kkrik-es Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the copying parser always copies the current value preemptively, iiuc? This should increase overhead for all indexed docs, not just the ones with malformed values..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I don't know if it should be in this section but we can mention it somewhere. Let me follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait with mentioning the copy parser overhead until we have a better understanding how much it actually is? I suspect it is minimal.

Copy link
Contributor

@kkrik-es kkrik-es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're getting there.

Shall we call out that object params are supported too? Esp disabled is an interesting one, as it increases storage but works as expected.

Copy link

@tylerperk tylerperk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just some tiny changes. Thanks

docs/reference/mapping/fields/synthetic-source.asciidoc Outdated Show resolved Hide resolved
@lkts lkts requested a review from tylerperk July 2, 2024 22:01
Copy link
Contributor

@shainaraskas shainaraskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with a couple of small additional suggestions!

docs/reference/mapping/fields/synthetic-source.asciidoc Outdated Show resolved Hide resolved
docs/reference/mapping/fields/synthetic-source.asciidoc Outdated Show resolved Hide resolved
docs/reference/mapping/fields/synthetic-source.asciidoc Outdated Show resolved Hide resolved
@@ -178,4 +153,38 @@ that ordering.

[[synthetic-source-modifications-ranges]]
====== Representation of ranges
Range field vales (e.g. `long_range`) are always represented as inclusive on both sides with bounds adjusted accordingly. See <<range-synthetic-source-inclusive, examples>>.
Range field vales (e.g. `long_range`) are always represented as inclusive on both sides with bounds adjusted accordingly. See <<range-synthetic-source-inclusive, examples>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this say values?

docs/reference/mapping/fields/synthetic-source.asciidoc Outdated Show resolved Hide resolved
@elasticsearchmachine
Copy link
Collaborator

Hi @lkts, I've updated the changelog YAML for you. Note that since this PR is labelled release highlight, you need to update the changelog YAML to fill out the extended information sections.

@lkts lkts added v8.15.0 and removed v8.16.0 labels Jul 4, 2024
Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lkts lkts merged commit 276ae12 into elastic:main Jul 4, 2024
5 checks passed
@lkts lkts deleted the rework_synthetic_source_docs branch July 4, 2024 16:48
@lkts lkts added the auto-backport Automatically create backport pull requests when merged label Jul 4, 2024
lkts added a commit to lkts/elasticsearch that referenced this pull request Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >docs General docs changes release highlight :StorageEngine/Mapping The storage related side of mappings Team:Docs Meta label for docs team Team:StorageEngine v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants