Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to filter query based on data tier #68135

Closed
dakrone opened this issue Jan 28, 2021 · 21 comments · Fixed by #69288
Closed

Add ability to filter query based on data tier #68135

dakrone opened this issue Jan 28, 2021 · 21 comments · Fixed by #69288
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@dakrone
Copy link
Member

dakrone commented Jan 28, 2021

Some use cases have the desire to query data within a certain tier (or set of tiers), for example, in the presence of a data stream or alias using ILM, query data only in the "hot" tier. (See: #47881 where users have asked for ILM supporting aliases so that queries can target a specific lifecycle of data).

It could be nice to have a general purpose query that could be used for regular searching (as well as aggregations) that allowed specifying a "tier" of data to query. This would allow a query like:

{
  "query": {
    "bool": {
      "filter": {
        "tier": "hot"
      }
    }
  },
  "aggs": {...}
}

This is especially nice when users start using searchable snapshots for their data, as it would allow bypassing indices in other tiers (such as "cold" and "frozen") without requiring any sort of download of data.

One question that may come up is "why not just use a time range filter for getting the most recent data?". This is useful when only consuming a single set of data (such as a single data stream), but if we had a first-class query for data tier searching, multiple data streams and aliases could be queried that have differing "hot" tier definitions without requiring the user to both be aware of the timing for the tier and separate the filter range based on specific index patterns. For example: searching three data streams that have data in the hot phases for 7, 14, and 21 days respectively, using tier: hot is much simpler than specifying three different range filters tied to three different data stream index names.

This also helps some of the use cases in #47881 while being accessible to both data streams and aliases.

If this is of interest, we could perform the filtering for this prior to any query execution as the tier is accessible through the index metadata and could be rewritten to exclude indices that aren't in the specified tier.

@dakrone dakrone added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Jan 28, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jan 28, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@skearns64
Copy link
Contributor

++ to this concept.

If I'm understanding it right, having this would make it easy to use DLS to make roles that can only query specific data tiers? e.g. regular users can query hot/warm. power users can also query Cold/Frozen.

@dakrone
Copy link
Member Author

dakrone commented Jan 28, 2021

having this would make it easy to use DLS to make roles that can only query specific data tiers?

Yes that could definitely be a use for this as well!

@jpountz jpountz added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Search/Search Search-related issues that do not fall into other categories labels Feb 1, 2021
@jpountz
Copy link
Contributor

jpountz commented Feb 1, 2021

The easiest way to make it work would be to define a _tier metadata field, similar to the _index field.

@markharwood
Copy link
Contributor

Given the way the tier checking code works would this make sense to have a metadata field called node_role which can be queried for values like data_warm?

I can see arguments both ways for tier == warm or node_role == data_warm fields/queries. The former provides some abstraction over implementation details while the latter allows for other future node roles to be testable too.

@markharwood
Copy link
Contributor

Started with a PR for this but struggling - not sure how the field mapper API allows me to get hold of the current DiscoveryNode in order to get hold of node role settings so I can call a tester method on DataTier similar to this one,

The alternative is perhaps for the field mapper to get hold of the index settings and test index.routing.allocation.include._tier_preference which is more of an aspiration than a guarantee of tier.

@jimczi
Copy link
Contributor

jimczi commented Feb 19, 2021

You only need the node settings to call the static function you linked. You can access them through IndexSettings#getNodeSettings. I don't think we should expose the discovery node in the SearchExecutionContext, looking at the node settings should be enough if you add a static function in DataTier that returns the list of roles.

@markharwood
Copy link
Contributor

You can access them through IndexSettings#getNodeSettings

I see only a subset of settings using that - it is only the settings part of what I get if I call http://localhost:9200/_nodes/MY_NODE/settings (see full output here ).
This doesn't include the node roles info provided in that fuller output.

@jimczi
Copy link
Contributor

jimczi commented Feb 19, 2021

That's because the default values are not materialized in the Settings. You need to retrieve the values through NodeRoleSettings#NODE_ROLES_SETTING.

@skearns64
Copy link
Contributor

Is the thinking to allow filtering on the properties of the node an index lives on, or on the properties of the index (e.g. ILM Phase, or whether a given index is partial or full searchable snapshot), or other?

@jpountz
Copy link
Contributor

jpountz commented Feb 22, 2021

I can see arguments both ways for tier == warm or node_role == data_warm fields/queries. The former provides some abstraction over implementation details while the latter allows for other future node roles to be testable too.

In my opinion, these should be different features. Mappings are about the data so the _tier field should be the tier of the index, not the role of the node that hosts a shard. If we wanted to introduce a way to filter by node role, I'd rather expose it via a different feature such as preference.

@markharwood
Copy link
Contributor

markharwood commented Feb 22, 2021

I'd rather expose it via a different feature such as preference.

Search node-routing preference or querying index-allocation preference?

I can see arguments both ways for tier == warm or node_role == data_warm fields/queries. The former provides some abstraction over implementation details while the latter allows for other future node roles to be testable too.

That question was about how we test node roles and whether we offer a focused subset of them (tier = hot/cold/warm/frozen) or whether we allow testing any node role (anyKey = anyValue).

The other question I raised and Steve mentioned is about testing hot/warm/cold etc against the index-allocation preference.

Filtering by node role seems to make more sense to me as that determines the real performance characteristics while the index-allocation preference is a perhaps-unfulfilled aspiration and the index may be stuck on the wrong node.

@jpountz
Copy link
Contributor

jpountz commented Feb 22, 2021

I was thinking of making the _tier (or _data_tier?) field mean the current ILM phase of the index and making it possible to filter by node role via search preference, e.g. something like _only_roles=data_hot,data_warm.

Filtering by node role seems to make more sense to me as that determines the real performance characteristics while the index-allocation preference is a perhaps-unfulfilled aspiration and the index may be stuck on the wrong node.

My point is not really about which use-case is more important, it's more that I would only use a metadata field to expose something that is intrinsic to the data, not an allocation detail. I have no objections to enabling filtering by node role, but I don't think we should do it using mappings or the query DSL: preference feels like a better fit to me for this use-case.

@markharwood
Copy link
Contributor

I don't think we should do it using mappings or the query DSL: preference feels like a better fit to me for this use-case.

I agree - my assumption was this was just being done in the query syntax because it was somehow easier for Kibana to express.

@jimczi
Copy link
Contributor

jimczi commented Feb 22, 2021

I agree that filtering by node role should be done through the preference but that seems like another feature as Adrien noticed. The _tier metadata field seems a better fit for the specific ask in this issue. It would represent the current ILM phase of the index and would be queryable like any other metadata field. It's also a good fit for the terms enum API so I am not sure if we really need the filtering by node role at all. The intent here is to filter indices based on their status, if they are cold but not migrated yet, the filtering should still work.

@dakrone
Copy link
Member Author

dakrone commented Feb 22, 2021

It would represent the current ILM phase of the index and would be queryable like any other metadata field.

I would suggest that maybe we consider using the _tier_preference parameter rather than the ILM phase for the object of filtering, as it would allow a user to query for things in the data_content tier that don't necessarily use ILM. It would also mean that external tools (like curator) could make use of it.

@markharwood
Copy link
Contributor

I would suggest that maybe we consider using the _tier_preference parameter

I updated the PR to consult the index.routing.allocation.include._tier_preference setting.
Should we update the queryable _tier field name to _tier_preference too?

@jpountz
Copy link
Contributor

jpountz commented Feb 23, 2021

I'd keep it _tier. The _tier_preference setting is just the implementation details we use to know about the current tier of the index?

@markharwood
Copy link
Contributor

markharwood commented Feb 23, 2021

The _tier_preference setting is just the implementation details

My thinking was that tier setting for an index might actually be wrong - while it has a preference to be allocated to a particular tier of node it may be stuck somewhere else. Generally things gravitate from hot to colder tiers so a delayed movement is likely to be in a warmer tier than expected and that's probably not a problem for searches that will tend to prefer warmer end of things (at least for autocomplete).
The word "preference" helps convey it might not be the case - like when we use "hint" in execution_hint

@jpountz
Copy link
Contributor

jpountz commented Feb 23, 2021

I think _tier is fine as it is meant to mean the logical tier of the data, even if it might not be honored temporarily for practical reasons.

@markharwood
Copy link
Contributor

I found some unintended consequences from treating this as a queryable index field as opposed to a node-routing preference:

  • ML needs "_tier" adding to a list of fields excluded from dataframe analytics
  • SQL tests broke when I added the keyword family type - potentially another field blacklist to be updated
  • I was advised there's likely to be some ML UI that has a blacklist for field drop-downs (there may be other UIs e.g. enterprise search)

While the ability to query the tier as a field may be a convenience to some (preferable to node routing) it may be worth considering the number of field blacklists that need to be maintained where the new field is an inconvenience.

markharwood added a commit to markharwood/elasticsearch that referenced this issue Mar 31, 2021
…es on the roles defined (explicitly or implicitly)for a node.

Closes elastic#68135
markharwood added a commit that referenced this issue Mar 31, 2021
New _tier metadata field that supports term, terms, exists and wildcard queries on the first data tier preference stated for an index.

Closes #68135
markharwood added a commit to markharwood/elasticsearch that referenced this issue Apr 7, 2021
…rd queries on the first data tier preference stated for an index.

Backport of 3aee4c1

Closes elastic#68135
markharwood added a commit that referenced this issue Apr 7, 2021
* New _tier metadata field that supports term, terms, exists and wildcard queries on the first data tier preference stated for an index.

Backport of 3aee4c1

Closes #68135
jtibshirani added a commit that referenced this issue Apr 8, 2021
Now that the `fields` option allows fetching metadata fields, we can support
loading the new `_tier` metadata field.

Relates to #63569 and #68135.
jtibshirani added a commit that referenced this issue Apr 8, 2021
Now that the `fields` option allows fetching metadata fields, we can support
loading the new `_tier` metadata field.

Relates to #63569 and #68135.
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants