Don't serialize the entire query when query parsing fails #51843

jimczi · 2020-02-04T00:42:58Z

Today when the query parsing fails on a shard, we create a QueryShardException that serializes the entire QueryBuilder in the message. All these exceptions are kept on a per-shard basis by the coordinating node so we except them to be light weight. For very large boolean queries (with lots of clauses that trip the max_clause_count for instance) this can cause memory pressure on the coordinating node since we don't de-duplicate exceptions coming from different shards.
The simplest fix would be to limit the size of the QueryShardException's message by truncating the QueryBuilder serialization if it is above a certain threshold (1k ?).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-02-04T00:43:00Z

Pinging @elastic/es-search (:Search/Search)

elasticmachine · 2020-02-04T00:43:01Z

Pinging @elastic/es-core-infra (:Core/Infra/Logging)

QueryBuilders that throw exceptions on shards when building the Lucene query returns the full serialization of the query builder in the exception message. For large queries that fails to execute due to the max boolean clause, this means that we keep a reference of these big messages for every shard that participate in the request. In order to limit the memory needed to hold these query shard exceptions in the coordinating node, this change removes the query builder serialization from the shard exception. The query is known by the user so there should be no need to repeat it on every shard exception. We could also omit the entire stack trace for known bad request exception but it would deserve a separate issue/pr. Closes elastic#51843 Closes elastic#48910

altinp · 2020-02-05T15:50:33Z

Thanks @jimczi, this should sort out #48910 ! One question (and pardon my ignorance): is it not possible for the coordinating node to do some (basic/partial?) parsing just to ensure that global circuit breakers like too_many_clauses are not tripped, so as to not even send the query to all shards? If not, when putting together the response, it could check the first shard failure's reason.caused_by.caused_by.{type|reason} and if something global like too_many_clauses, it could send back a much smaller response, omitting the rest of the shard failures.

…ge (#51885) QueryBuilders that throw exceptions on shards when building the Lucene query returns the full serialization of the query builder in the exception message. For large queries that fails to execute due to the max boolean clause, this means that we keep a reference of these big messages for every shard that participate in the request. In order to limit the memory needed to hold these query shard exceptions in the coordinating node, this change removes the query builder serialization from the shard exception. The query is known by the user so there should be no need to repeat it on every shard exception. We could also omit the entire stack trace for known bad request exception but it would deserve a separate issue/pr. Closes #51843 Closes #48910

jimczi · 2020-02-06T08:33:50Z

is it not possible for the coordinating node to do some (basic/partial?) parsing just to ensure that global circuit breakers like too_many_clauses are not tripped, so as to not even send the query to all shards?

The coordinating node doesn't have the mapping so it's not possible at the moment. The change I committed is just a small fix, there's a lot of follow up that we could try. One thing that I'd like to test is whether we can de-duplicate the shard failures at the coordinator level. We shouldn't have to save every shard failures if they are all the same (parsing failure, too many clauses, ...). Another possibility would be to try to early terminate the query if a shard failure is reported and allow_partial_results is set to false.

altinp · 2020-02-06T15:06:47Z

The coordinating node doesn't have the mapping so it's not possible at the moment.

Ah, true. Would this (and other?) use cases perhaps benefit from an integrated "schema" (mapping) registry? This could be replicated on all nodes for fast access, since the overall size would be small.
Even before getting to that though, would it not be possible to do some mapping-unaware, Lucene-syntactical parsing on the as-is user supplied query, like checking for the number of terms in a terms query, or of single term clauses, or of boolean OR clauses in query_string. I.e. skim over the query just to rule out a few early-termination opportunities. (But yes, perhaps worth waiting for a comprehensive approach instead.)

I also do agree on deduping the shard failures as they stream in. Even without peeking too deep into the error strings and relying on string similarity etc., you could perhaps check reason.caused_by.caused_by.type etc. for some global limit breach like too_many_clauses and collapse on that.

Early termination when allow_partial_search_results=false would be great for us! In all our current use cases, and probably many others', partial results are worse than none. This would be a solid last line of defense since it's not dependent on the type of query/failure. (As you noted, the case of a large query_string would need a different fix than #51885)

Thanks again!

altinp · 2020-02-06T15:43:12Z

@jimczi just to reconfirm: current behavior is for coordinating node to hold on to all shard failures regardless of allow_partial_search_results, right? I.e. this param only governs what is sent to the client, so a code change is needed.

One remaining worry is that this is only a search param. We have users who go through a UI facade that we control tightly but also data scientists who have more freedom in writing queries directly to ES. In the latter case, we'd still be vulnerable if they omit this param in their searches.

Perhaps this could also be made available as a global setting?

Finally, would you still want/have to distinguish between shard failures that can be retried against another replica and those that should cause the query to be terminated right away?

jimczi added >enhancement :Search/Search Search-related issues that do not fall into other categories :Core/Infra/Logging Log management and logging utilities labels Feb 4, 2020

jimczi mentioned this issue Feb 4, 2020

Remove the query builder serialization from QueryShardException message #51885

Merged

jimczi closed this as completed in #51885 Feb 6, 2020

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't serialize the entire query when query parsing fails #51843

Don't serialize the entire query when query parsing fails #51843

jimczi commented Feb 4, 2020 •

edited

Loading

elasticmachine commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

altinp commented Feb 5, 2020

jimczi commented Feb 6, 2020

altinp commented Feb 6, 2020 •

edited

Loading

altinp commented Feb 6, 2020

Don't serialize the entire query when query parsing fails #51843

Don't serialize the entire query when query parsing fails #51843

Comments

jimczi commented Feb 4, 2020 • edited Loading

elasticmachine commented Feb 4, 2020

elasticmachine commented Feb 4, 2020

altinp commented Feb 5, 2020

jimczi commented Feb 6, 2020

altinp commented Feb 6, 2020 • edited Loading

altinp commented Feb 6, 2020

jimczi commented Feb 4, 2020 •

edited

Loading

altinp commented Feb 6, 2020 •

edited

Loading