Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search API: Expontial performance degradation and increase std. deviation when specifying "size" param greater than 999999 #5466

Closed
bobbyhubbard opened this issue Mar 19, 2014 · 7 comments

Comments

@bobbyhubbard
Copy link

(Now first of all, this scenario is ridiculous and quite unscientific in its method. Furthermore, this is not blocking us in any way shape or form but we found it interesting so I thought I would report it anyway.)

One of our clients has a use case where they want all search hits to be returned without pagination. Typically, in this case, results sets max out at around 300 documents. Since they want all results, and there is no ALL option for size, the developer choose to use 999999999 (9 9's) as the size. While silly, this is still well within the limits of an Integer in Java and was just meant to signify something like MAX_INT.

The result was a query that took on average between 2000 and 5000ms for 229 total hits. They reported this issue to us and we investigated. Now the interesting part - reducing the size parameter by a factor of 10 (remove one 9) showed a similar factor of 10 reduction in performance time. So 99999999 (8 9's) loads on average in 200 to 500ms. Reduce by another factor of 10 (7 9's) and it drops almost another factor of 10. At 7 9's loads times are between 40 and 150ms. This is still a much higher std deviation than typical for repeat search's for the exact same search...likely cached.

At 6 9's, the results are more in line with expectations and have a much smaller std deviation at 20-30ms on avg.

This test was done on an isolated 2 node cluster with no other activity on the system. The same query was executed for all tests with the only difference being the size param.

@dadoonet
Copy link
Member

To extract data from elasticsearch you should use scan&scroll API.

Using size is not the way to go.

Closing. Feel free to reopen if I misunderstood your use case.

@bobbyhubbard
Copy link
Author

Its a performance defect that someone could exploit.

My point is that some unsuspecting user typing a simple search query with a size >999999 could have significant performance impact on a cluster as response time seems to increase exponentially until MAX_INT at which point you get an index out of bounds from the json parser. Of course specifying a size that large is not optimal... but I'm not always the one writing the query.

I'm unable to reopen but if I could, i would. :)

@nik9000
Copy link
Member

nik9000 commented Mar 19, 2014

It might be worth having a cluster wide max size that could be configured
to reject requests like this. It opens up a can of worms, too. Like, you
should probably excuse scan type queries.

On Wed, Mar 19, 2014 at 4:42 PM, bobbyhubbard [email protected]:

My point is that some unsuspecting user typing a simple search query with
a size >999999 could have significant performance impact on a cluster as
response time seems to increase exponentially until MAX_INT when you an
index out of bounds from the json parser. Of course specifying a size that
large is not optimal... but I'm not always the one writing the query. Its a
performance defect that someone could exploit.

Reply to this email directly or view it on GitHubhttps://github.com//issues/5466#issuecomment-38104150
.

@dadoonet
Copy link
Member

I agreed that we should perhaps come with reasonable defaults (500?) which can be set.
What does others think?

Reopening.

@dadoonet dadoonet reopened this Mar 19, 2014
@uboness
Copy link
Contributor

uboness commented Mar 20, 2014

It's not just about the json... There are many factors that play here (the priority queues that are responsible for the sorting and the size of the docs to name a couple). I agree that this should ideally be handled gracefully, probably by introducing another circuit breaker - and the response should be handled like all other CBs we have. It's a bit tricky though to figure out the proper thresholds for this CB... Will require some thought.

Btw, I'm not sure that the error belongs to the 500 range as this should not be perceived as a system error...

@skade
Copy link
Contributor

skade commented Sep 23, 2014

I had a run-in with this bug today, where the maximum size was set to an "arbitrary high value" (9999999) because the number of potential responses was small (<100) and it was meant to retrieve "all". That query degraded to a point where it brought down the whole cluster node by node. That's a naive approach I see from time to time. In the end, the query retrieved only < 100kb payload.
The query was sent to the type-specific end-point /*index*/*type*.

The response time grew with the number of documents in the whole index, not showing itself on small datasets (dev), less on slightly larger (stage) and disastrous (largest time was 18m) on the live system. In Prod, it lead to sudden allocations of huge chunks of heap that would immediately be collected by a stop the world collection of multiple seconds, leading to nodes dropping out of the cluster and the search queue exploding. It seems like there is something leading to such a query to visit more documents then necessary. I'll try to build a testcase.

While this is misuse, I would expect Elasticsearch to handle such cases more gracefully. Also, the behaviour of "size" should be better documented.

@clintongormley
Copy link
Contributor

Closing in favour of #4026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants