Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter Editor Suggestions can take forever and drastically increase the load in a cluster. #12692

Closed
trevan opened this issue Jul 6, 2017 · 19 comments

Comments

@trevan
Copy link
Contributor

trevan commented Jul 6, 2017

I'm currently upgrading to the latest master and noticed the Filter Editor suggestions from #11375. I looked at the search requests that it makes and there is a BIG problem. I took one of the search requests and ran it against our production data to see how it would fair. I let it run for 3 minutes before I killed it and then took the next several minutes trying to keep the cluster a float. All nodes in our cluster had their load average go from the 20's to the 70's.

I think the main issues are that:

  1. It doesn't have any time filter. It searches across all shards. In my case, we have 6000 shards.
  2. It runs an aggregation for any searchable fields. This includes fields that use "fielddata".
@Bargs
Copy link
Contributor

Bargs commented Jul 6, 2017

@lukasolson

@trevan
Copy link
Contributor Author

trevan commented Jul 7, 2017

After more looking at it, I think if there is some sort of time filter or shard limit, then this would not be a problem. The aggregation on fields with fielddata is only problematic (for me, at least) when you try and do an aggregation on them across all shards.

@lukasolson
Copy link
Member

Adding the time filter to the request has pros and cons. If you include the time filter as part of the request for suggestions, then you're assuming the user will always update the time filter prior to adding filters in the filter editor UI. Users might be confused and expect values to show up as suggestions because they know the field contains those values. On the other hand, users might actually expect to only see suggestions for the given time range, and obviously it would improve performance, and would make suggestions more specific and applicable to the data you're actually viewing.

What kind of impact does including the time range in the terms request have for you, @trevan? Also, what would you think about including shard_size in the request to improve performance? Does specifying "shard_size": 10 make any noticeable difference in the performance of the query?

Thanks for helping us out here, @trevan!

@trevan
Copy link
Contributor Author

trevan commented Jul 8, 2017

@lukasolson, there appears to be some sort of cache. I'm limiting the number of indices in my test query now to prevent issues but once I find a set of indices that cause 30 seconds, subsequent queries take only 5-6 seconds. I've tried _cache/clear and request_cache=false but that doesn't seem to do anything.

But if I take the "cached" results that take 5-6 seconds, adding the "shard_size: 10" does nothing. It still stays between 5-6 seconds. Setting "size: 0" also doesn't affect the speed. Removing the leading regex part doesn't affect it either. Adding a timespan (24 hours which shrinks the number of shards to 36 for my test query) takes less than a 1 second.

I then tried using the 24 hour timespan on the full index list. It took 26 seconds to run the first time. The second time was 7 seconds. The third time was 3 seconds. The next several runs were all around 3 seconds.

I wonder if you could edit your PR and add an additional option that would have the suggestions use the timespan. I understand your concerns but I would much rather have some suggestions with a timespan vs no suggestions using your PR.

@trevan
Copy link
Contributor Author

trevan commented Jul 10, 2017

@lukasolson, I ran the query against the full dataset during a down time. It took a little over 4 minutes for the first time. Subsequent runs took around 40-50 seconds.

Adding the "shard_size:10" didn't seem to make any change to the time. It still took around 40 seconds.

Adding a 24 hour limit didn't affect the first run, but subsequent runs it only took 6 seconds.

I tried adding "routing=1" to the request and that brought the original query down to 12-15 seconds while bringing down the 24 hour limit query to 1-2 seconds. We don't use custom routing but it appears to give a decent win.

@lukasolson
Copy link
Member

Thanks for troubleshooting, @trevan. I'm reaching out to the Elasticsearch team to see if we can preserve this functionality without causing the problems you've run into.

@trevan
Copy link
Contributor Author

trevan commented Jul 11, 2017

@lukasolson, do we want to keep this issue open since it would be nice to have suggestions without a performance hit?

@lukasolson
Copy link
Member

@trevan Yup, unintentionally closed this by tagging it in the PR.

@lukasolson lukasolson reopened this Jul 11, 2017
@lukasolson
Copy link
Member

@trevan could you try another thing for me with your dataset? In my tests it doesn't seem to make a big impact but a second opinion is always nice 😄

Here's the query it currently makes when using suggestions:

POST health-*/_search
{
  "size": 0,
  "aggs": {
    "suggestions": {
      "terms": {
        "field": "device.search",
        "include": ".*0x.*"
      }
    }
  }
}

Someone suggested we try using sampling. Here's how it might look instead:

POST health-*/_search
{
  "size": 0,
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 10
      },
      "aggs": {
        "suggestions": {
          "terms": {
            "field": "device.search",
            "include": ".*0x.*"
          }
        }
      }
    }
  }
}

Would you mind giving that a try and seeing if it has a large impact on your performance?

@lukasolson
Copy link
Member

Also, if you want to see how they perform without considering the cache, you could always clear the cache for a specific set of indices. Probably wouldn't want to do this on production, but yeah.

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html

@trevan
Copy link
Contributor Author

trevan commented Jul 19, 2017

@lukasolson, I ran that query. It took it from 40 seconds to 18 seconds. So it was definitely an improvement. I tried it with shard_size: 1 but that didn't decrease it noticeably.

Also, I did try using the _cache/clear as well as request_cache back when I first mentioned the cache issue. If I let elasticsearch use the cache, after a while (about 4-5 requests), the request only takes 1-2 seconds.

@filex
Copy link

filex commented Jul 25, 2017

Same problem here. The filter editor was saturating all CPUs in our ES5 cluster with 5 nodes searching for term suggestions.

As soon as a keyword field is selected and I choose the is operator, Kibana sends two XHRs. Both go to /api/kibana/suggestions/values/logstash-* (according to the currently selected index pattern). But they differ in payload:

one is

{"field":"@source_host.keyword"}

the other is

{"query":"","field":"@source_host.keyword"}

They both ran into timeouts for logstash-*. The get shorter when I limit the time with a smaller index pattern like logstash-2017.07.2*.

The query is not only extremely expensive – there are two of them running in parallel. Do we need both queries? When they eventually finish both of them return the same list of 10 terms.

The find-as-you-type style makes it even worse by firing new searches with every key I press.

I agree, that it is important to bring down query times. But could it also be beneficial to limit the number of suggest searches one client may run at a time?

@lukasolson
Copy link
Member

@filex Good point, there should only be one at a time for sure. Thanks for pointing this out.

@lruslan
Copy link

lruslan commented Jul 25, 2017

Same problem. We have cluster of 4 datanodes with 1.5TB of daily index, 1500 fields.
For us there is no use case for such filter suggestions - users know their data and in most cases filter on values already known in advance.

Would be handy to let administrator disable/enable auto suggestion through the Management -> Advanced Settings, this way we could disable this feature and save cluster from being stressed out.

Also I don't know why but from the moment I open filter editor and fill in complete expression i see approximately 30-50 requests made to single datanode in elasticsearch (i have slow log with logging of all queries enabled)
For example in this example I'm trying to fill in filter expression, without hitting enter/save button:
"field:kubernetes.container_name" +"is"+"slowtest"
You can see resulted amount of queries in the attached query log(our slow log itself parsed by separate ES instance): filterlog.txt

@lukasolson
Copy link
Member

lukasolson commented Aug 2, 2017

@trevan Could you play around with adding the terminate_after parameter in your request? I'm thinking about exposing this as an advanced setting in the filter editor suggestions request, and if it's set to 0, disabling suggestions altogether. It can be used in the request like this:

POST health-*/_search?request_cache=false
{
  "size": 0,
  "terminate_after": 100000,
  "aggs": {
    "suggestions": {
      "terms": {
        "field": "device.search",
        "include": ".*0x.*",
        "execution_hint": "map"
      }
    }
  }
}

Basically, this will tell the request to stop after hitting 100000 documents. The higher this number, the better the suggestions, but the longer the query will run.

@trevan
Copy link
Contributor Author

trevan commented Aug 3, 2017

@lukasolson, that gives really impressive results. With that value of terminate_after, the query takes around 1 second. I increased terminate_after by 10x and it took 5 seconds. Setting terminate_after to 0, though, doesn't seem to disable the query. Instead, it just disables the terminate_after setting because it took 40-50 seconds.

I'm not sure about the execution_hint change. With terminate_after set to 1000000, it takes 5 seconds without the execution_hint and 12 seconds with the execution_hint. At 100000, it seems to add a few tenths of a second as well.

@lukasolson
Copy link
Member

@trevan When I mentioned that setting the value to 0 would disable the suggestions entirely, I meant that we'd have a condition in the code that turned them off, not that the query itself would be disabled.

@LoadingZhang
Copy link

Still same problem. With ES/Kibana 6.4.0.
We have 16TB+/150+Bil indices in ES, with high cardinality fields such as uid.
If turn on autocomplete feature, then input uid: in search bar, kibana would send this query to ES:

{
  "size": 0,
  "timeout": "1s",
  "terminate_after": 100000,
  "query": {
    "match_all": {
      "boost": 1.0
    }
  },
  "aggregations": {
    "suggestions": {
      "terms": {
        "field": "uid",
        "size": 10,
        "shard_size": 10,
        "min_doc_count": 1,
        "shard_min_doc_count": 0,
        "show_term_doc_count_error": false,
        "execution_hint": "map",
        "order": [
          {
            "_count": "desc"
          },
          {
            "_key": "asc"
          }
        ],
        "include": ".*"
      }
    }
  }
}

After 10+min, It will load 160GB+ fielddata to memory, seems loading fielddata can't terminate property.

@LoadingZhang
Copy link

And this is hot thread:

10/10 snapshots sharing following 41 elements
       org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:168)
       org.apache.lucene.index.OrdinalMap.build(OrdinalMap.java:147)
       org.elasticsearch.index.fielddata.ordinals.GlobalOrdinalsBuilder.build(GlobalOrdinalsBuilder.java:65)
       org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData.localGlobalDirect(SortedSetDVOrdinalsIndexFieldData.java:130)
       org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData.localGlobalDirect(SortedSetDVOrdinalsIndexFieldData.java:47)
       org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.lambda$load$1(IndicesFieldDataCache.java:165)
       org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$$Lambda$3267/1059313702.load(Unknown Source)
       org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
       org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:162)
       org.elasticsearch.index.fielddata.plain.SortedSetDVOrdinalsIndexFieldData.loadGlobal(SortedSetDVOrdinalsIndexFieldData.java:118)
       org.elasticsearch.search.aggregations.support.ValuesSource$Bytes$WithOrdinals$FieldData.globalOrdinalsValues(ValuesSource.java:151)
       org.elasticsearch.search.aggregations.support.ValuesSource$Bytes$WithOrdinals.globalMaxOrd(ValuesSource.java:124)
       org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory.getMaxOrd(TermsAggregatorFactory.java:215)
       org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory.doCreateInternal(TermsAggregatorFactory.java:137)
       org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:55)
       org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:216)
       org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:216)
       org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:55)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:105)
       org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$17(IndicesService.java:1184)
       org.elasticsearch.indices.IndicesService$$Lambda$3192/791757691.accept(Unknown Source)
       org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$18(IndicesService.java:1237)
       org.elasticsearch.indices.IndicesService$$Lambda$3194/1117148458.get(Unknown Source)
       org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:160)
       org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:143)
       org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:433)
       org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:116)
       org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1243)
       org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1183)
       org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:322)
       org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:357)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:333)
       org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:329)
       org.elasticsearch.search.SearchService$3.doRun(SearchService.java:1019)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

looks like es is building ordinal map

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants