Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reflect latest changes in synthetic source documentation #109501

Merged
merged 15 commits into from
Jul 4, 2024
Merged
14 changes: 14 additions & 0 deletions docs/changelog/109501.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
pr: 109501
summary: Reflect latest changes in synthetic source documentation
area: Mapping
type: enhancement
issues: []
highlight:
title: Synthetic `_source` improvements
body: |-
There are multiple improvements to synthetic `_source` functionality:

* Synthetic `_source` is now supported for all field types including `nested` and `object`. `object` fields are supported with `enabled` set to `false`.

* Synthetic `_source` can be enabled together with `ignore_malformed` and `ignore_above` parameters for all field types that support them.
notable: false
3 changes: 2 additions & 1 deletion docs/reference/data-streams/tsds.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,9 @@ shard segments by `_tsid` and `@timestamp`.
documents, the document `_id` is a hash of the document's dimensions and
`@timestamp`. A TSDS doesn't support custom document `_id` values.


* A TSDS uses <<synthetic-source,synthetic `_source`>>, and as a result is
subject to a number of <<synthetic-source-restrictions,restrictions>>.
subject to some <<synthetic-source-restrictions,restrictions>> and <<synthetic-source-modifications,modifications>> applied to the `_source` field.

NOTE: A time series index can contain fields other than dimensions or metrics.

Expand Down
12 changes: 6 additions & 6 deletions docs/reference/mapping/fields/source-field.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ at index time. The `_source` field itself is not indexed (and thus is not
searchable), but it is stored so that it can be returned when executing
_fetch_ requests, like <<docs-get,get>> or <<search-search,search>>.

If disk usage is important to you then have a look at
<<synthetic-source,synthetic `_source`>> which shrinks disk usage at the cost of
only supporting a subset of mappings and slower fetches or (not recommended)
<<disable-source-field,disabling the `_source` field>> which also shrinks disk
usage but disables many features.
If disk usage is important to you, then consider the following options:

- Using <<synthetic-source,synthetic `_source`>>, which reconstructs source content at the time of retrieval instead of storing it on disk. This shrinks disk usage, at the cost of slower access to `_source` in <<docs-get,Get>> and <<search-search,Search>> queries.
- <<disable-source-field,Disabling the `_source` field completely>>. This shrinks disk
usage but disables features that rely on `_source`.

include::synthetic-source.asciidoc[]

Expand Down Expand Up @@ -43,7 +43,7 @@ available then a number of features are not supported:
* The <<docs-update,`update`>>, <<docs-update-by-query,`update_by_query`>>,
and <<docs-reindex,`reindex`>> APIs.

* In the {kib} link:{kibana-ref}/discover.html[Discover] application, field data will not be displayed.
* In the {kib} link:{kibana-ref}/discover.html[Discover] application, field data will not be displayed.

* On the fly <<highlighting,highlighting>>.

Expand Down
83 changes: 48 additions & 35 deletions docs/reference/mapping/fields/synthetic-source.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -28,45 +28,22 @@ PUT idx

While this on the fly reconstruction is *generally* slower than saving the source
documents verbatim and loading them at query time, it saves a lot of storage
space.
space. Additional latency can be avoided by not loading `_source` field in queries when it is not needed.

[[synthetic-source-fields]]
===== Supported fields
Synthetic `_source` is supported by all field types. Depending on implementation details, field types have different properties when used with synthetic `_source`.

<<synthetic-source-fields-native-list, Most field types>> construct synthetic `_source` using existing data, most commonly <<doc-values,`doc_values`>> and <<stored-fields, stored fields>>. For these field types, no additional space is needed to store the contents of `_source` field. Due to the storage layout of <<doc-values,`doc_values`>>, the generated `_source` field undergoes <<synthetic-source-modifications, modifications>> compared to original document.

For all other field types, the original value of the field is stored as is, in the same way as the `_source` field in non-synthetic mode. In this case there are no modifications and field data in `_source` is the same as in the original document. Similarly, malformed values of fields that use <<ignore-malformed,`ignore_malformed`>> or <<ignore-above,`ignore_above`>> need to be stored as is. This approach is less storage efficient since data needed for `_source` reconstruction is stored in addition to other data required to index the field (like `doc_values`).

[[synthetic-source-restrictions]]
===== Synthetic `_source` restrictions

There are a couple of restrictions to be aware of:
Synthetic `_source` cannot be used together with field mappings that use <<copy-to,`copy_to`>>.

* When you retrieve synthetic `_source` content it undergoes minor
<<synthetic-source-modifications,modifications>> compared to the original JSON.
* Synthetic `_source` can be used with indices that contain only these field
types:

** <<aggregate-metric-double-synthetic-source, `aggregate_metric_double`>>
** {plugins}/mapper-annotated-text-usage.html#annotated-text-synthetic-source[`annotated-text`]
** <<binary-synthetic-source,`binary`>>
** <<boolean-synthetic-source,`boolean`>>
** <<numeric-synthetic-source,`byte`>>
** <<date-synthetic-source,`date`>>
** <<date-nanos-synthetic-source,`date_nanos`>>
** <<dense-vector-synthetic-source,`dense_vector`>>
** <<numeric-synthetic-source,`double`>>
** <<flattened-synthetic-source, `flattened`>>
** <<numeric-synthetic-source,`float`>>
** <<geo-point-synthetic-source,`geo_point`>>
** <<geo-shape-synthetic-source,`geo_shape`>>
** <<numeric-synthetic-source,`half_float`>>
** <<histogram-synthetic-source,`histogram`>>
** <<numeric-synthetic-source,`integer`>>
** <<ip-synthetic-source,`ip`>>
** <<keyword-synthetic-source,`keyword`>>
** <<numeric-synthetic-source,`long`>>
** <<range-synthetic-source,`range` types>>
** <<numeric-synthetic-source,`scaled_float`>>
** <<search-as-you-type-synthetic-source,`search_as_you_type`>>
** <<numeric-synthetic-source,`short`>>
** <<text-synthetic-source,`text`>>
** <<token-count-synthetic-source,`token_count`>>
** <<version-synthetic-source,`version`>>
** <<wildcard-synthetic-source,`wildcard`>>
Some field types have additional restrictions. These restrictions are documented in the **synthetic `_source`** section of the field type's <<mapping-types,documentation>>.

[[synthetic-source-modifications]]
===== Synthetic `_source` modifications
Expand Down Expand Up @@ -178,4 +155,40 @@ that ordering.

[[synthetic-source-modifications-ranges]]
====== Representation of ranges
Range field vales (e.g. `long_range`) are always represented as inclusive on both sides with bounds adjusted accordingly. See <<range-synthetic-source-inclusive, examples>>.
Range field values (e.g. `long_range`) are always represented as inclusive on both sides with bounds adjusted accordingly. See <<range-synthetic-source-inclusive, examples>>.

[[synthetic-source-precision-loss-for-point-types]]
====== Reduced precision of `geo_point` values
Values of `geo_point` fields are represented in synthetic `_source` with reduced precision. See <<geo-point-synthetic-source, examples>>.


[[synthetic-source-fields-native-list]]
===== Field types that support synthetic source with no storage overhead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence on whether to report the fields that have additional indexing overhead when ignore_malformed or ignore_above are set. Wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fair, i imagine normally you expect malformed values to be a small portion of your data.

Copy link
Contributor

@kkrik-es kkrik-es Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the copying parser always copies the current value preemptively, iiuc? This should increase overhead for all indexed docs, not just the ones with malformed values..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I don't know if it should be in this section but we can mention it somewhere. Let me follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait with mentioning the copy parser overhead until we have a better understanding how much it actually is? I suspect it is minimal.

The following field types support synthetic source using data from <<doc-values,`doc_values`>> or <<stored-fields, stored fields>>, and require no additional storage space to construct the `_source` field.

NOTE: If you enable the <<ignore-malformed,`ignore_malformed`>> or <<ignore-above,`ignore_above`>> settings, then additional storage is required to store ignored field values for these types.

** <<aggregate-metric-double-synthetic-source, `aggregate_metric_double`>>
** {plugins}/mapper-annotated-text-usage.html#annotated-text-synthetic-source[`annotated-text`]
** <<binary-synthetic-source,`binary`>>
** <<boolean-synthetic-source,`boolean`>>
** <<numeric-synthetic-source,`byte`>>
** <<date-synthetic-source,`date`>>
** <<date-nanos-synthetic-source,`date_nanos`>>
** <<dense-vector-synthetic-source,`dense_vector`>>
** <<numeric-synthetic-source,`double`>>
** <<flattened-synthetic-source, `flattened`>>
** <<numeric-synthetic-source,`float`>>
** <<geo-point-synthetic-source,`geo_point`>>
** <<numeric-synthetic-source,`half_float`>>
** <<histogram-synthetic-source,`histogram`>>
** <<numeric-synthetic-source,`integer`>>
** <<ip-synthetic-source,`ip`>>
** <<keyword-synthetic-source,`keyword`>>
** <<numeric-synthetic-source,`long`>>
** <<range-synthetic-source,`range` types>>
** <<numeric-synthetic-source,`scaled_float`>>
** <<numeric-synthetic-source,`short`>>
** <<text-synthetic-source,`text`>>
** <<version-synthetic-source,`version`>>
** <<wildcard-synthetic-source,`wildcard`>>