Add ability to filter query based on data tier #68135

dakrone · 2021-01-28T17:13:12Z

Some use cases have the desire to query data within a certain tier (or set of tiers), for example, in the presence of a data stream or alias using ILM, query data only in the "hot" tier. (See: #47881 where users have asked for ILM supporting aliases so that queries can target a specific lifecycle of data).

It could be nice to have a general purpose query that could be used for regular searching (as well as aggregations) that allowed specifying a "tier" of data to query. This would allow a query like:

{
  "query": {
    "bool": {
      "filter": {
        "tier": "hot"
      }
    }
  },
  "aggs": {...}
}

This is especially nice when users start using searchable snapshots for their data, as it would allow bypassing indices in other tiers (such as "cold" and "frozen") without requiring any sort of download of data.

One question that may come up is "why not just use a time range filter for getting the most recent data?". This is useful when only consuming a single set of data (such as a single data stream), but if we had a first-class query for data tier searching, multiple data streams and aliases could be queried that have differing "hot" tier definitions without requiring the user to both be aware of the timing for the tier and separate the filter range based on specific index patterns. For example: searching three data streams that have data in the hot phases for 7, 14, and 21 days respectively, using tier: hot is much simpler than specifying three different range filters tied to three different data stream index names.

This also helps some of the use cases in #47881 while being accessible to both data streams and aliases.

If this is of interest, we could perform the filtering for this prior to any query execution as the tier is accessible through the index metadata and could be rewritten to exclude indices that aren't in the specified tier.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-01-28T17:13:15Z

Pinging @elastic/es-search (Team:Search)

skearns64 · 2021-01-28T17:22:14Z

++ to this concept.

If I'm understanding it right, having this would make it easy to use DLS to make roles that can only query specific data tiers? e.g. regular users can query hot/warm. power users can also query Cold/Frozen.

dakrone · 2021-01-28T17:24:23Z

having this would make it easy to use DLS to make roles that can only query specific data tiers?

Yes that could definitely be a use for this as well!

jpountz · 2021-02-01T18:06:18Z

The easiest way to make it work would be to define a _tier metadata field, similar to the _index field.

markharwood · 2021-02-18T17:21:03Z

Given the way the tier checking code works would this make sense to have a metadata field called node_role which can be queried for values like data_warm?

I can see arguments both ways for tier == warm or node_role == data_warm fields/queries. The former provides some abstraction over implementation details while the latter allows for other future node roles to be testable too.

markharwood · 2021-02-19T12:41:57Z

Started with a PR for this but struggling - not sure how the field mapper API allows me to get hold of the current DiscoveryNode in order to get hold of node role settings so I can call a tester method on DataTier similar to this one,

The alternative is perhaps for the field mapper to get hold of the index settings and test index.routing.allocation.include._tier_preference which is more of an aspiration than a guarantee of tier.

jimczi · 2021-02-19T13:04:37Z

You only need the node settings to call the static function you linked. You can access them through IndexSettings#getNodeSettings. I don't think we should expose the discovery node in the SearchExecutionContext, looking at the node settings should be enough if you add a static function in DataTier that returns the list of roles.

markharwood · 2021-02-19T14:43:21Z

You can access them through IndexSettings#getNodeSettings

I see only a subset of settings using that - it is only the settings part of what I get if I call http://localhost:9200/_nodes/MY_NODE/settings (see full output here ).
This doesn't include the node roles info provided in that fuller output.

jimczi · 2021-02-19T14:58:07Z

That's because the default values are not materialized in the Settings. You need to retrieve the values through NodeRoleSettings#NODE_ROLES_SETTING.

skearns64 · 2021-02-19T19:32:05Z

Is the thinking to allow filtering on the properties of the node an index lives on, or on the properties of the index (e.g. ILM Phase, or whether a given index is partial or full searchable snapshot), or other?

jpountz · 2021-02-22T13:53:50Z

I can see arguments both ways for tier == warm or node_role == data_warm fields/queries. The former provides some abstraction over implementation details while the latter allows for other future node roles to be testable too.

In my opinion, these should be different features. Mappings are about the data so the _tier field should be the tier of the index, not the role of the node that hosts a shard. If we wanted to introduce a way to filter by node role, I'd rather expose it via a different feature such as preference.

markharwood · 2021-02-22T14:27:50Z

I'd rather expose it via a different feature such as preference.

Search node-routing preference or querying index-allocation preference?

I can see arguments both ways for tier == warm or node_role == data_warm fields/queries. The former provides some abstraction over implementation details while the latter allows for other future node roles to be testable too.

That question was about how we test node roles and whether we offer a focused subset of them (tier = hot/cold/warm/frozen) or whether we allow testing any node role (anyKey = anyValue).

The other question I raised and Steve mentioned is about testing hot/warm/cold etc against the index-allocation preference.

Filtering by node role seems to make more sense to me as that determines the real performance characteristics while the index-allocation preference is a perhaps-unfulfilled aspiration and the index may be stuck on the wrong node.

jpountz · 2021-02-22T15:16:41Z

I was thinking of making the _tier (or _data_tier?) field mean the current ILM phase of the index and making it possible to filter by node role via search preference, e.g. something like _only_roles=data_hot,data_warm.

Filtering by node role seems to make more sense to me as that determines the real performance characteristics while the index-allocation preference is a perhaps-unfulfilled aspiration and the index may be stuck on the wrong node.

My point is not really about which use-case is more important, it's more that I would only use a metadata field to expose something that is intrinsic to the data, not an allocation detail. I have no objections to enabling filtering by node role, but I don't think we should do it using mappings or the query DSL: preference feels like a better fit to me for this use-case.

markharwood · 2021-02-22T15:56:18Z

I don't think we should do it using mappings or the query DSL: preference feels like a better fit to me for this use-case.

I agree - my assumption was this was just being done in the query syntax because it was somehow easier for Kibana to express.

jimczi · 2021-02-22T17:37:20Z

I agree that filtering by node role should be done through the preference but that seems like another feature as Adrien noticed. The _tier metadata field seems a better fit for the specific ask in this issue. It would represent the current ILM phase of the index and would be queryable like any other metadata field. It's also a good fit for the terms enum API so I am not sure if we really need the filtering by node role at all. The intent here is to filter indices based on their status, if they are cold but not migrated yet, the filtering should still work.

dakrone · 2021-02-22T18:58:56Z

It would represent the current ILM phase of the index and would be queryable like any other metadata field.

I would suggest that maybe we consider using the _tier_preference parameter rather than the ILM phase for the object of filtering, as it would allow a user to query for things in the data_content tier that don't necessarily use ILM. It would also mean that external tools (like curator) could make use of it.

markharwood · 2021-02-23T11:37:02Z

I would suggest that maybe we consider using the _tier_preference parameter

I updated the PR to consult the index.routing.allocation.include._tier_preference setting.
Should we update the queryable _tier field name to _tier_preference too?

jpountz · 2021-02-23T13:01:11Z

I'd keep it _tier. The _tier_preference setting is just the implementation details we use to know about the current tier of the index?

markharwood · 2021-02-23T14:15:22Z

The _tier_preference setting is just the implementation details

My thinking was that tier setting for an index might actually be wrong - while it has a preference to be allocated to a particular tier of node it may be stuck somewhere else. Generally things gravitate from hot to colder tiers so a delayed movement is likely to be in a warmer tier than expected and that's probably not a problem for searches that will tend to prefer warmer end of things (at least for autocomplete).
The word "preference" helps convey it might not be the case - like when we use "hint" in execution_hint

jpountz · 2021-02-23T16:09:27Z

I think _tier is fine as it is meant to mean the logical tier of the data, even if it might not be honored temporarily for practical reasons.

markharwood · 2021-02-24T12:12:22Z

I found some unintended consequences from treating this as a queryable index field as opposed to a node-routing preference:

ML needs "_tier" adding to a list of fields excluded from dataframe analytics
SQL tests broke when I added the keyword family type - potentially another field blacklist to be updated
I was advised there's likely to be some ML UI that has a blacklist for field drop-downs (there may be other UIs e.g. enterprise search)

While the ability to query the tier as a field may be a convenience to some (preferable to node routing) it may be worth considering the number of field blacklists that need to be maintained where the new field is an inconvenience.

…es on the roles defined (explicitly or implicitly)for a node. Closes elastic#68135

New _tier metadata field that supports term, terms, exists and wildcard queries on the first data tier preference stated for an index. Closes #68135

…rd queries on the first data tier preference stated for an index. Backport of 3aee4c1 Closes elastic#68135

* New _tier metadata field that supports term, terms, exists and wildcard queries on the first data tier preference stated for an index. Backport of 3aee4c1 Closes #68135

Now that the `fields` option allows fetching metadata fields, we can support loading the new `_tier` metadata field. Relates to #63569 and #68135.

dakrone added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Jan 28, 2021

elasticmachine added the Team:Search Meta label for search team label Jan 28, 2021

dakrone mentioned this issue Jan 28, 2021

Add ILM action to add/remove aliases #47881

Open

jpountz added :Search Foundations/Mapping Index mappings, including merging and defining field types and removed :Search/Search Search-related issues that do not fall into other categories labels Feb 1, 2021

markharwood mentioned this issue Feb 18, 2021

New terms_enum API for discovering terms in the index. #66452

Merged

markharwood mentioned this issue Feb 19, 2021

New queryable "_tier" metadata field #69288

Merged

markharwood added a commit to markharwood/elasticsearch that referenced this issue Mar 31, 2021

New _tier metadata field that supports term, terms and wildcard queri…

d8c5822

…es on the roles defined (explicitly or implicitly)for a node. Closes elastic#68135

markharwood closed this as completed in #69288 Mar 31, 2021

markharwood added a commit that referenced this issue Mar 31, 2021

New queryable "_tier" metadata field (#69288)

3aee4c1

New _tier metadata field that supports term, terms, exists and wildcard queries on the first data tier preference stated for an index. Closes #68135

markharwood mentioned this issue Mar 31, 2021

New queryable "_tier" metadata field (#69288) #71123

Merged

jtibshirani mentioned this issue Apr 6, 2021

Support fetching _tier field value #71379

Merged

markharwood added a commit to markharwood/elasticsearch that referenced this issue Apr 7, 2021

New _tier metadata field that supports term, terms, exists and wildca…

08a188d

…rd queries on the first data tier preference stated for an index. Backport of 3aee4c1 Closes elastic#68135

jtibshirani mentioned this issue Apr 7, 2021

Make sure _tier field handles missing setting #71439

Merged

jtibshirani added a commit that referenced this issue Apr 8, 2021

Support fetching _tier field value (#71379)

3da738e

Now that the `fields` option allows fetching metadata fields, we can support loading the new `_tier` metadata field. Relates to #63569 and #68135.

jtibshirani added a commit that referenced this issue Apr 8, 2021

Support fetching _tier field value (#71379)

4205b04

Now that the `fields` option allows fetching metadata fields, we can support loading the new `_tier` metadata field. Relates to #63569 and #68135.

markharwood mentioned this issue Apr 12, 2021

Meta issue - new auto complete API #71550

Closed

9 tasks

stevejgordon mentioned this issue Apr 21, 2021

7.13.0 Meta Ticket elastic/elasticsearch-net#5584

Closed

62 tasks

javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to filter query based on data tier #68135

Add ability to filter query based on data tier #68135

dakrone commented Jan 28, 2021

elasticmachine commented Jan 28, 2021

skearns64 commented Jan 28, 2021

dakrone commented Jan 28, 2021

jpountz commented Feb 1, 2021

markharwood commented Feb 18, 2021

markharwood commented Feb 19, 2021

jimczi commented Feb 19, 2021

markharwood commented Feb 19, 2021

jimczi commented Feb 19, 2021

skearns64 commented Feb 19, 2021

jpountz commented Feb 22, 2021 •

edited

Loading

markharwood commented Feb 22, 2021 •

edited

Loading

jpountz commented Feb 22, 2021

markharwood commented Feb 22, 2021

jimczi commented Feb 22, 2021

dakrone commented Feb 22, 2021

markharwood commented Feb 23, 2021

jpountz commented Feb 23, 2021

markharwood commented Feb 23, 2021 •

edited

Loading

jpountz commented Feb 23, 2021

markharwood commented Feb 24, 2021

Add ability to filter query based on data tier #68135

Add ability to filter query based on data tier #68135

Comments

dakrone commented Jan 28, 2021

elasticmachine commented Jan 28, 2021

skearns64 commented Jan 28, 2021

dakrone commented Jan 28, 2021

jpountz commented Feb 1, 2021

markharwood commented Feb 18, 2021

markharwood commented Feb 19, 2021

jimczi commented Feb 19, 2021

markharwood commented Feb 19, 2021

jimczi commented Feb 19, 2021

skearns64 commented Feb 19, 2021

jpountz commented Feb 22, 2021 • edited Loading

markharwood commented Feb 22, 2021 • edited Loading

jpountz commented Feb 22, 2021

markharwood commented Feb 22, 2021

jimczi commented Feb 22, 2021

dakrone commented Feb 22, 2021

markharwood commented Feb 23, 2021

jpountz commented Feb 23, 2021

markharwood commented Feb 23, 2021 • edited Loading

jpountz commented Feb 23, 2021

markharwood commented Feb 24, 2021

jpountz commented Feb 22, 2021 •

edited

Loading

markharwood commented Feb 22, 2021 •

edited

Loading

markharwood commented Feb 23, 2021 •

edited

Loading