Caching of filters #16108

santanusinha · 2016-01-20T04:32:16Z

Hi,
As I can see in Filter Auto Caching section, the control has been taken away from users for disabling caching in known cases.
This would cause a lot of problems in situations where elasticsearch is being used as a timeseries database. Typically analysts might run some one-off queries over older time ranges causing the filter cache to blow up without any reason. In previous versions, we would have turned of caching for the time range filter for queries over older ranges. Getting the data would be slower, but given that the data is coming for older date range, people could live with it.
I might be missing something, but right now it seems impossible to mimic this behaviour due to lack of configurability in choosing which queries to cache in the filter segment and which queries not to. The strategy discussed in the aforementioned documentation make sense in most of the general situation, but not 100% of the time. I feel that the option should at-least be present so that people can use it in times of need.
If you don't mind my asking, what was the rationale behind the decision to remove the filter caching configuration, and are there any chances that this will be brought back in the future?

jimczi · 2016-01-20T10:00:49Z

As I can see in Filter Auto Caching section, the control has been taken away from users for disabling caching in known cases.

That's partly true, in a boolean query the filter clauses are cacheable whereas must and shouldclauses are not.
That's also partly true ( ;) ) because the should and must clauses are cacheable if they appear in a context where the score is not needed.
I agree that the fine grain tuning of the cache is not easy in 2.x and that we'll need to add some documentations about it. @jpountz I also agree that it could be useful to add (re-add ?) a clear statement "not to cache" for each query that could be eligible to the cache. It is sometimes difficult to control it especially because some rewrite of the queries are not controllable by the user (constant score query are sometimes added under the hood).

santanusinha · 2016-01-20T10:27:03Z

@jimferenczi hmm .. yes, query rewrites might cause an issue here, but if the intent of the user is to not do caching for a particular query, maybe cache=false should be set in the rewritten queries also .. right?

jimczi · 2016-01-20T10:44:50Z

@santanusinha yes this is my point. If the user knows that the query should not be cached (even if the score is not needed) then we should have something in the query that state clearly that we don't want this part of the query to enter the filter cache.

clintongormley · 2016-01-20T11:00:00Z

Typically analysts might run some one-off queries over older time ranges causing the filter cache to blow up without any reason

This should not longer happen as filters will only be cached after repeated use - this is one of the reasons for the rewrite, to stop overcaching filters by default. Really, this is something that the user should't have to think about; Elasticsearch should be smart enough to figure it out for itself. Of course, these algorithms need iteration to improve.

Another feature which is already there, but needs improvement, is shard request caching... to explain: a typical use case is showing page views per hour for the last month. Using the index-per-day model, only the data for today's index is changing. The request cache (you need to turn on caching) will cache the aggregation results for all of the other indices, and only recalculate the results for today's index - huge improvement.

But there are a couple of issues that we are working on fixing. The first is that the JSON request must be exactly the same in order to retrieve the cached version. This can be tricky because the order of keys in JSON can vary. The search refactoring happening in #10217 will fix this because we'll use the parsed representation of the query for caching instead of the JSON.

The second is that these queries usually use a time range. If you use now, that time will change on every request and so won't use the cache. If you use now/h (now rounded to the nearest hour) then it can use the cached entry for the whole hour.

Once the search refactoring is done, we can improve this situation by checking whether the min and max values in range query are lower/higher respectively than the min/max values for a particular shard and, if so, rewrite the range query as a match_all. This would mean that, even though now is used with millisecond resolution, the request cache would still work.

santanusinha · 2016-01-20T11:06:15Z

Thanks for the explanation. Will keep my eyes out for issues and report back if we see anything.

jimczi added discuss :Search/Search Search-related issues that do not fall into other categories :Cache labels Jan 20, 2016

santanusinha closed this as completed Jan 20, 2016

clintongormley mentioned this issue Feb 2, 2016

Remove in-memory fielddata support #14113

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching of filters #16108

Caching of filters #16108

santanusinha commented Jan 20, 2016

jimczi commented Jan 20, 2016

santanusinha commented Jan 20, 2016

jimczi commented Jan 20, 2016

clintongormley commented Jan 20, 2016

santanusinha commented Jan 20, 2016

Caching of filters #16108

Caching of filters #16108

Comments

santanusinha commented Jan 20, 2016

jimczi commented Jan 20, 2016

santanusinha commented Jan 20, 2016

jimczi commented Jan 20, 2016

clintongormley commented Jan 20, 2016

santanusinha commented Jan 20, 2016