-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search API: Expontial performance degradation and increase std. deviation when specifying "size" param greater than 999999 #5466
Comments
To extract data from elasticsearch you should use scan&scroll API. Using Closing. Feel free to reopen if I misunderstood your use case. |
Its a performance defect that someone could exploit. My point is that some unsuspecting user typing a simple search query with a size >999999 could have significant performance impact on a cluster as response time seems to increase exponentially until MAX_INT at which point you get an index out of bounds from the json parser. Of course specifying a size that large is not optimal... but I'm not always the one writing the query. I'm unable to reopen but if I could, i would. :) |
It might be worth having a cluster wide max size that could be configured On Wed, Mar 19, 2014 at 4:42 PM, bobbyhubbard [email protected]:
|
I agreed that we should perhaps come with reasonable defaults (500?) which can be set. Reopening. |
It's not just about the json... There are many factors that play here (the priority queues that are responsible for the sorting and the size of the docs to name a couple). I agree that this should ideally be handled gracefully, probably by introducing another circuit breaker - and the response should be handled like all other CBs we have. It's a bit tricky though to figure out the proper thresholds for this CB... Will require some thought. Btw, I'm not sure that the error belongs to the 500 range as this should not be perceived as a system error... |
I had a run-in with this bug today, where the maximum size was set to an "arbitrary high value" (9999999) because the number of potential responses was small (<100) and it was meant to retrieve "all". That query degraded to a point where it brought down the whole cluster node by node. That's a naive approach I see from time to time. In the end, the query retrieved only < 100kb payload. The response time grew with the number of documents in the whole index, not showing itself on small datasets (dev), less on slightly larger (stage) and disastrous (largest time was 18m) on the live system. In Prod, it lead to sudden allocations of huge chunks of heap that would immediately be collected by a stop the world collection of multiple seconds, leading to nodes dropping out of the cluster and the search queue exploding. It seems like there is something leading to such a query to visit more documents then necessary. I'll try to build a testcase. While this is misuse, I would expect Elasticsearch to handle such cases more gracefully. Also, the behaviour of "size" should be better documented. |
Closing in favour of #4026 |
(Now first of all, this scenario is ridiculous and quite unscientific in its method. Furthermore, this is not blocking us in any way shape or form but we found it interesting so I thought I would report it anyway.)
One of our clients has a use case where they want all search hits to be returned without pagination. Typically, in this case, results sets max out at around 300 documents. Since they want all results, and there is no ALL option for size, the developer choose to use 999999999 (9 9's) as the size. While silly, this is still well within the limits of an Integer in Java and was just meant to signify something like MAX_INT.
The result was a query that took on average between 2000 and 5000ms for 229 total hits. They reported this issue to us and we investigated. Now the interesting part - reducing the size parameter by a factor of 10 (remove one 9) showed a similar factor of 10 reduction in performance time. So 99999999 (8 9's) loads on average in 200 to 500ms. Reduce by another factor of 10 (7 9's) and it drops almost another factor of 10. At 7 9's loads times are between 40 and 150ms. This is still a much higher std deviation than typical for repeat search's for the exact same search...likely cached.
At 6 9's, the results are more in line with expectations and have a much smaller std deviation at 20-30ms on avg.
This test was done on an isolated 2 node cluster with no other activity on the system. The same query was executed for all tests with the only difference being the size param.
The text was updated successfully, but these errors were encountered: