Find and kill long running queries #7157

avleen · 2014-08-04T22:50:15Z

We run all queries with timeout=160s, but I understand this only really bounds the collection phase of the search?

We had someone run a query using aggregations today over billions of records, which brought the cluster down.
The cluster never OOM'd, but it did run into constant GC as the heap got full.
Another query was run again over potentially huge amounts of data. The cluster had indexing disabled as it was recovering from the previous event, and since the query was run it's been at 100% CPU for about 40 minutes now. The query, afaict, is still running. But we have no way of knowing what it is, and no way to kill it other than restarting the entire cluster.

I'd like to request a feature that lists all queries that are currently executing on every node, as well a way to kill them while they're in progress.

The text was updated successfully, but these errors were encountered:

imotov · 2014-08-04T23:56:02Z

This looks like a duplicate of #4586 and #6914. So, I am going to close it. Please feel free to reopen it if I am wrong and you think that this issue is substantially different.

clintongormley · 2014-08-05T11:37:21Z

@avleen just to add another comment to this ticket: the circuit breakers in 1.3 should handle this better than the version you're running, and the circuit breakers in 1.4 better still.

kimchy · 2014-08-05T11:40:27Z

aye, that was my thought, better timeout logic is very important (to properly bound requests across all its phase of execution, and make them cancelable), but the circuit breaker on 1.3 should do a good job at not allowing to load expensive data structure that can't be loaded, and the improved circuit breaker in 1.4 will allow to break on expensive requests that require too much resources for the request level (like the resources required to just compute sig terms).

avleen · 2014-08-05T13:03:32Z

Fantastic. Thanks folks. We upgraded to 1.3 while the cluster was down.
Fingers crossed!
On Aug 5, 2014 7:40 AM, "Shay Banon" [email protected] wrote:

aye, that was my thought, better timeout logic is very important (to
properly bound requests across all its phase of execution, and make them
cancelable), but the circuit breaker on 1.3 should do a good job at not
allowing to load expensive data structure that can't be loaded, and the
improved circuit breaker in 1.4 will allow to break on expensive requests
that require too much resources for the request level (like the resources
required to just compute sig terms).

Reply to this email directly or view it on GitHub
#7157 (comment)
.

imotov closed this as completed Aug 4, 2014

shikhar mentioned this issue Aug 5, 2014

Task management #6914

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find and kill long running queries #7157

Find and kill long running queries #7157

avleen commented Aug 4, 2014

imotov commented Aug 4, 2014

clintongormley commented Aug 5, 2014

kimchy commented Aug 5, 2014

avleen commented Aug 5, 2014

Find and kill long running queries #7157

Find and kill long running queries #7157

Comments

avleen commented Aug 4, 2014

imotov commented Aug 4, 2014

clintongormley commented Aug 5, 2014

kimchy commented Aug 5, 2014

avleen commented Aug 5, 2014