Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve block loader fallback to source when source mode is synthetic. #115394

Open
3 tasks
martijnvg opened this issue Oct 23, 2024 · 9 comments
Open
3 tasks

Improve block loader fallback to source when source mode is synthetic. #115394

martijnvg opened this issue Oct 23, 2024 · 9 comments
Labels
:Analytics/Compute Engine Analytics in ES|QL >enhancement Meta :StorageEngine/Mapping The storage related side of mappings Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine

Comments

@martijnvg
Copy link
Member

martijnvg commented Oct 23, 2024

Sometimes MappedFieldType#blockLoader(...) implementations fallback to an implementation that uses source. For example when a field has doc values or stored fields disabled, when ignore above or ignore above have been configured. Meaning it would read the _source field and then extract the relevant field out of it and use that as value to be returned by the block loader.

When synthetic source is enabled then instead the source gets computed from many doc value or stored fields, and then the relevant field gets extracted. This is very slow and this should be improved. The interesting part with synthetic source is that we don't need to compute the source in order to provided a fallback values as part of block loaders returned by MappedFieldType#blockLoader(...).

Synthetic source details relevant to block loader fallback logic:

  • A field value exceeds the configured ignore above, then the value is stored in a separate stored field with the suffix _original.
  • A field value is malformed, then the value gets stored in a stored field with the same name as is defined in the mapping. Regardless of whether stored fields are enabled.
  • A field is has no stored or doc values fields. Then it gets stored in _ignored_source stored field.
  • Ignored source is the fallback for synthetic source to avoid content in source getting lost. So for example if a an object field is disabled or number of allowed mapped fields is exceeded, the field values / content should end up in ignored source.

In case of synthetic source the block loaders returned by MappedFieldType#blockLoader(...) can be made aware if these details and instead of returning a BlockSourceReader based implementation, return an implementation that uses the right stored field or uses ignored source.

Tasks:

  • Handle ignore above more efficiently.
  • Handle ignore malformed more efficiently.
  • Handle the reading field values that are stored in ignored source.
@martijnvg martijnvg added :Analytics/Compute Engine Analytics in ES|QL :StorageEngine/Mapping The storage related side of mappings >enhancement Meta labels Oct 23, 2024
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine labels Oct 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@martijnvg
Copy link
Member Author

This POC (#114886) shows how falling back to ignore source can work.

@felixbarny
Copy link
Member

Does this also cover cases where source filtering is used? In other words, when you only need to retrieve a specific field from _source, can we avoid synthesizing the full _source, which means fetching all fields?

@martijnvg
Copy link
Member Author

This issue is in the context of synthetic source, but the idea is to avoid synthesizing the full source when only subset of fields is required. This isn't the case today.

@felixbarny
Copy link
Member

My comment was also in context of synthetic source. Basically asking if we could also add an optimization to not re-construct the full _source when source filtering is used. Instead, just fetching fields that are required per the source filtering configuration. So if a doc has 100 fields, but the search request contains "_source": [ "foo", "bar" ], we could optimize to only fetch those two fields when synthesizing the source instead of synthesizing the full source and then filtering it afterwards.

@martijnvg
Copy link
Member Author

Currently this issue is about es|ql's fallback mechanism to source when source mode is synthetic.

There is an issue for the search api: #94001
And @jimczi did relevant work recently that does source filtering correctly for synthetic source in the get api (#113827).

@jimczi
Copy link
Contributor

jimczi commented Oct 25, 2024

I am currently focused on #114618 and was planning to resume and finish work for #113827 after that. @felixbarny are you interested by this change for the get or the search API? I am asking because get is much simpler to achieve than search which is why we started there.

@felixbarny
Copy link
Member

I'm interested in _search. It's not something urgent. The question came up in the context of refactoring the APM UI to use fields. But some places still use _source with source filtering. Before the refactoring, there was a lot of usage of _source with filtering in the context of search. So I was suspecting that there may be other places that make use of _search + filtering that would have a performance regression when using synthetic _source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Compute Engine Analytics in ES|QL >enhancement Meta :StorageEngine/Mapping The storage related side of mappings Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:StorageEngine
Projects
None yet
Development

No branches or pull requests

4 participants