Log queries before execution #9172

avleen · 2015-01-06T20:21:58Z

Currently the best way to emulate a full query log, is to set the time on the slow query log to 0s.
The downside to this, is that queries aren't logged until they finish running.
If the query takes a really long time (we've seen queries that can overload a box, cause GC problems and basically never end), you never find out what the query was.

Can we have a separate log which records all queries before they are run?

souravmitra · 2015-01-27T15:59:31Z

@clintongormley : Do you have any concerns regarding the implementation for this request. What factors should somebody take care of, if he were to think about implementing the same. Thanks.

avleen · 2015-01-27T17:41:01Z

From an operational perspective, it would be great to see two lines per query:

First before the query is run.
Second after it finishes, and logs the time it took.

The lines would need to have some kind of common uuid logged (or something) so they can be easily paired up.
Specifically for the use case I mentioned, being able to grep for UUIDs and seeing ones that only appeared once would immediately make problematic queries obvious!

vvcephei · 2015-08-19T14:10:22Z

This is something that we'd dearly love to see as well. In fact, we would be satisfied with just a regular full query log. We almost took a crack at implementing it and sending a PR, but decided just to update and use the Jetty plugin instead.

We were a little apprehensive about using the 0ms slow query log, as we have pretty high query volume (8k requests/sec), and we weren't sure we could trust the slow query logger not to bottleneck performance (and also not to fill the disk).

The other big problem with the slow query log is that it logs fetch and query separately at the shard level, which I agree is useful for analyzing slow queries, but it's not what you want if you're trying to measure the full query execution time.

The jetty plugin works pretty well for query logging, but:

I'm not stoked about replacing netty with jetty just to get query logging
It only logs the full execution time after the request, so doesn't fully solve @avleen's issue
I can't comment on production performance over a long period of time, since we only turn on query logs when we want to collect a sample of logs for a few hours or days.

nirmalc · 2015-08-19T14:45:36Z

+1 , @clintongormley is this "adoptme" one ? we need this feature too and open to working on it .

clintongormley · 2015-08-24T10:42:51Z

It looks like this has been marked as discuss for a while, but hasn't actually been discussed yet :)

Just some thoughts to get the discussion going:

Doc values by default in 2.0 will help with many cases of OOM/slow GC (but not all)
Often slow GCs are the compound result of a number of requests, rather than a single bad request (although one bad request can be responsible)
The top-searches feature (Add _top/searches API #12187) will help to identify current long queries
I think not all requests using the Java API can be rendered as JSON currently (Log slow queries as json, not binary. #12992)
Would we log on the coordinating node or the data node? If the latter, per shard or per node?

nik9000 · 2015-08-24T13:57:32Z

I'm a fan of this request. In a previous life we did this logging on the client side and used it to find a few bugs.

Often slow GCs are the compound result of a number of requests, rather than a single bad request (although one bad request can be responsible)

But it'll still be visible in the logs. The logs might not be the right tool for identifying what is causing them but this could be helpful.

The top-searches feature (#12187) will help to identify current long queries

Its more black-box-ish than a log at start and stop. Logs around the query starting and stopping are more bullet proof I think.

I think not all requests using the Java API can be rendered as JSON currently (#12992)

This is probably worth fixing.

Would we log on the coordinating node or the data node? If the latter, per shard or per node?

Probably all and make turning it on and off an index level dynamic setting. I'd settle for just doing it on the coordinating node ala SearchSlowLog as a first cut. That is only slightly better than what clients can do.

paullovessearch · 2016-05-14T16:23:14Z

+1

evanvolgas · 2016-07-22T20:58:10Z

This should probably xref Etsy's ES Restlog (https://github.com/etsy/es-restlog#overview). The points they bring up are very good, especially the point about

{the slow query log} operates at the shard-request level so you end up with lots of lines logged in case there are multiple shards or query phases involved.

I'm also a little unclear on the state of this discussion. It kinda seems like this got de-prioritized in favor of #12187 which then maybe kinda sorta got rolled into this #15117 somewhat maybe (I'm actually not sure I follow how 15117 addresses this issue or 12187 exactly unless searches, including failed searches, are tasks too... or maybe it's just Friday afternoon and I'm overlooking something very obvious?).

The original discussion seemed to focus on troubleshooting GC, for example. For me, with ES 2*, my experience has been exactly what @clintongormley suspected they would be, eg,

Doc values by default in 2.0 will help with many cases of OOM/slow GC (but not all)

Often slow GCs are the compound result of a number of requests, rather than a single bad request (although one bad request can be responsible)

That being said, I still think this idea has a lot of merit. One of the things the MySQL community got very right was the work they did with the Percona query digest (https://www.percona.com/doc/percona-toolkit/2.1/pt-query-digest.html). A DBA with the output of PT Query Digest can usually zoom in on a handful of bad query "fingerprints" that are causing an inordinate amount of work on the database and clean them up to great effect.

This kind of analysis -- figuring out what queries caused the database to do the most work -- fundamentally requires the ability to review and analyze the queries that the cluster is responding to -- on a per query (as opposed to a per shard) level. That's not really feasible with the slow query log.

And, specifically where that overlaps what this ticket is talking about, it's also not possible to identify query fingerprints that have a tendency to result in time outs or failures and might need some help. In some cases, being unable to identify these fingerprints might be devastating. Suppose you have a query on a cron job that's set to alert you if a certain event threshold is exceeded (or suppose you use Watcher or Elastalert and set up the same thing there). What if your data grows or your keys get skewed and a query that started out just fine develops a tendency to time out out every time you run it. Will this cron job / Watcher / Elastalert alert you to the fact that the query it's running keeps failing? The cron job might... and Elastalert and Watcher will certainly complain in their log files (maybe there's a way to push alerts on query failure but, if there is, I haven't noticed it). But will the developer / administrator of the cluster see the failures in the logs? It's not difficult to imagine that they wouldn't.

Personally, I think logging queries before execution is a great idea. Further, to the point @avleen made, I think it would also be a great idea to log queries after they execute, and to log them on the query (as opposed to shard) level.

Logging slow queries at the shard level makes great sense. But when you're trying to figure out what queries are failing most often and/or what queries your cluster spends the most time answering, I don't think the shard level is the right place to keep track of that information. As a datastore administrator, it is considerably easier for me to help developers understand which of their queries are problematic and/or failing when they're working with MySQL. With Elasticsearch... it can be done. But honestly it's a lot harder than it probably should because I don't have a cluster level record of

What queries users sent to the cluster
Whether or not the query failed
How long the query took from the time the cluster accepted it to the time it issued a response to the client

clintongormley · 2016-11-26T18:38:59Z

Logging queries at the start of execution means that you need a second log line for the execution time, which then complicates log parsing. You can now check the task manager for long running queries, and optionally kill them.

clintongormley added the discuss label Jan 13, 2015

vvcephei mentioned this issue Aug 19, 2015

Log slow queries as json, not binary. #12992

Closed

clintongormley added >feature :Core/Infra/Logging Log management and logging utilities labels Aug 24, 2015

evanvolgas mentioned this issue Sep 22, 2016

Add _top/searches API #12187

Closed

clintongormley closed this as completed Nov 26, 2016

evanvolgas mentioned this issue Mar 15, 2017

Ability to associate a search task ID #23250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log queries before execution #9172

Log queries before execution #9172

avleen commented Jan 6, 2015

souravmitra commented Jan 27, 2015

avleen commented Jan 27, 2015

vvcephei commented Aug 19, 2015

nirmalc commented Aug 19, 2015

clintongormley commented Aug 24, 2015

nik9000 commented Aug 24, 2015

paullovessearch commented May 14, 2016

evanvolgas commented Jul 22, 2016 •

edited

Loading

clintongormley commented Nov 26, 2016

Log queries before execution #9172

Log queries before execution #9172

Comments

avleen commented Jan 6, 2015

souravmitra commented Jan 27, 2015

avleen commented Jan 27, 2015

vvcephei commented Aug 19, 2015

nirmalc commented Aug 19, 2015

clintongormley commented Aug 24, 2015

nik9000 commented Aug 24, 2015

paullovessearch commented May 14, 2016

evanvolgas commented Jul 22, 2016 • edited Loading

clintongormley commented Nov 26, 2016

evanvolgas commented Jul 22, 2016 •

edited

Loading