-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17195: Add 'minPrefixLength' soft limit #2499
SOLR-17195: Add 'minPrefixLength' soft limit #2499
Conversation
Prefix queries have a memory cost that increases in proportion with the number of indexed terms that start with that prefix. This can cause stability problems, especially for short prefixes which can match a large number of indexed-terms. This commit introduces a new solrconfig.xml setting, `minPrefixLength`, similar to the existing `maxBooleanClauses`. This setting causes an error to be thrown any time a prefix query is created with a prefix whose length is less than the configured minumum. This setting is set at '2' by default, disallowing only single-character prefix queries. Still TODO: - ref guide documentation - more error-case testing - handle the faceting invocation - consider whether this should be disable-able on a per request basis
Three questions I'd like to answer before this is merged:
Other TODOs:
|
Also expands the setting to govern text field searches as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a query-parser based limit, not a field type enforced one.
@@ -103,6 +103,8 @@ public class SolrConfig implements MapSerializable { | |||
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); | |||
|
|||
public static final String DEFAULT_CONF_FILE = "solrconfig.xml"; | |||
public static final String MIN_PREFIX_LENGTH = "minPrefixLength"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from a naming standpoint, this is very ambiguous. Prefix of what? And should this be here at all vs. a request parameter, interpreted by some (not all) query parsers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming-wise, I was very much following the model of maxBooleanClauses
. But in hindsight, I think you're right that this is too ambiguous.
For now I'll change this to minPrefixQueryTermLength
, but if there's another option there you like better, lmk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the "Term" in that add anything? I suggest removing that component. Not a strong opinion though.
I wish we brainstormed this a bit on the dev list or something prior to you showing up with a PR where there is maybe inertia to continue with it the way it is |
I don't really think that's a fair characterization. The JIRA ticket was open and advertised to all of us on I could've created a In any case, I'm happy to discuss/brainstorm the overall approach now. I think it stands on its own without considering "inertia". In short, my thought process was that the more broadly applicable the limit is, the more effective it will be at protecting stability. It's hard to achieve stability if the safeguards you're relying on are opt-in, "off" by default, or only apply to a subset of relevant queries. So I wanted to make the "minPrefix" limit be opt-out, apply broadly, etc. Solr already has a query-time limit that fits this "broadly applicable" model pretty well, also in service of stability: So that's mostly what I've tried to do in this PR: (1) make it "broad" by design, and (2) follow mBC as a model in terms of how to configure, document, and implement. @dsmiley - You noted above your preference for something parser-specific. Can you elaborate on that preference a bit? |
I'm sorry for challenging your due diligence on soliciting input.
This runs the real risk of an obscure limit showing itself in production that would hot have been caught in a test setup for a use-case that should absolutely not have such a limit. Generally "q" is provided by a user, uses a more user-facing query parser (namely edismax), and would be a nice place for such a limit. But otherwise, queries show up in tons of other places for internal use-cases written by systems programmers (i.e. us devs) using the default "lucene" query parser where I don't think such a limit at all makes sense (at least not by default). Random examples -- delete-by-query, facet.query, join query. The same is so for maxBooleanClauses! I was hoping you wouldn't bring that up but of course you did; it's the precedent for your decision. That unfortunate precedent caused a huge ugly debate with only one person holding veto power that insisted on maxBooleanClauses as it exists today. That might have made Yonik leave to do Heliosearch over, if there was a huge straw break that back. I hope we don't repeat that mistake. I've only ever increased maxBooleanClauses to something insane, not because there isn't a risk of users doing expensive queries that we want to limit, but instead because the correct (IMO) place to enforce query constraints is at the query parser. And we do this where I work for various things that in-effect constraint the query complexity. @markrmiller you and most others were involved in that ugly debate... what do you think of Jason's approach here? |
I'm not sure the edismax=user-queries, lucene=system-queries generalization is a safe one to make. And even if it is, I think you're making an additional assumption here that "system" queries wouldn't also benefit from this check. I'm not sure I agree. System-devs are more likely than the average user to know about Solr's performance pitfalls, but is that safe to assume across the board? I think our many years with the project put us in danger of overestimating the knowledge and context other users might have, and underestimating the value of safeguards. That's the key issue for me IMO - if I was more comfortable with those generalizations I'd be happy with an edismax fix. But as-is they feel like pretty shaky ground to base cluster-stability on.
But conceptually I think |
Thinking about this a bit more, I wonder if the real issue here is that there's no way to tell Solr unequivocally, on a request-by-request basis: "Don't second-guess me, I know this is safe". If users had a way to communicate that, there wouldn't be any need for us as developers to try to generalize about which queries came from "end" users, what users may or may not know about Solr performance, etc. Would adding something like that make you feel any better about the parser-agnostic approach @dsmiley ? It'd give savvy "system" users a way to sidestep the "hitting a limit on a safe query in prod" false-positive you expressed concern about above. |
“what do you think of Jason's approach here?” It is a kind of thorny practice that evades a good definitive judgment. I suppose the idea is, at some, perhaps fairly arbitrary limit, you fail things to force the user to reckon with what they are doing more carefully and actively raise limits if they think that's the correct action. I say perhaps fairly arbitrary because it's difficult to do any better, given varying hardware and data. The cost is that is that things that worked fine in testing may start blowing up randomly in production, perhaps needlessly, until you address the situation. As a user, I certainly would want to be able to opt in, though yes, that makes the effort likely of little value, particularly if it's opt-in per limit. A universal opt-in would be better but would likely still end up as little value then. As a user, I'll still be pissed when production starts failing things the first time, and I judge it should not be doing that for my case. And at that point, I would be very unhappy if there was not a universal opt-out. Thorny. I do see the value, but in most cases, that value would not likely be high enough to convince me as a user that this was a good default system practice. Other than for very particular things, it doesn't seem like a common practice in other systems because of how thorny it is. I think some limits have a bit more clear value, though. Things that can build up over time and then are very difficult to address. Something along the lines of no limit to the number of fields on a document, where things can work fine for the first 6 months, but over time, you are using more and more fields and eventually the system just kind of seizes under the weight, and you can't do much but start over. Maybe hitting something much earlier that said you are using a strangely large number of fields, this is not a very scable approach, you have to change a configuration to continue down this path, would make a lot more sense. |
One use case where opt-in still has a lot of value is when one team runs Solr as a service or for other teams. In that case, it's much more of an available feature than automatic user self protection. |
@markrmiller mentioned to me offline that it was clear that my preference was for cluster-stability over convenience/admin-friendliness, but that I never made the rationale behind that preference explicit. He suggested I take one last crack at making the case for across-the-board limiting, as explicitly as possible. So here it is: Ultimately my rationale is pretty simple: aggressive prefix queries have the potential to block all traffic if they trigger an OOM. That's a worst-case-scenario for an administrator. There's pain for administrators either way. Realizing that some setting has errantly blocked some "known safe" queries definitely brings a serious "annoyance-factor". But to put myself in an administrator's shoes, I'd rather deal with that on 10, 20, even 100 clusters than deal with an "all traffic" outage at 3am because of a runaway prefix query. So that's my last pitch for across-the-board limiting. I'll give this a few days to see if it sways anyone; if no one else is convinced enough to chime in or reverse course though I'll go forward with something like the more limited approach that David championed. |
I've been on vacation lately. |
Did you have particular parsers in mind @dsmiley ? Or is there somewhere else that the project relied on this "end-user" vs "admin" parser distinction, that I could get a list from? I'm on board with expanding the set of parsers that will enforce this limit, but I don't have a great sense what you or others would consider "end-user" parsers. If you have a concrete subset in mind, then I'm happy to run with that.
I think I agree that "field-type" is a better predictor of "user query or not" than the query-parser would be. Maybe my experience is idiosyncratic, but prefix queries on TextField aren't the common case from what I've seen. Anecdotally, prefix queries are much much more common on StrField. We could apply the limit only on TextField, but if that only covers (e.g.) 20% of the prefix queries that'll hit the index - is it providing stability? I guess it still could, if the terms dictionary was large enough. But still, leaving the majority prefix-query use-case uncovered doesn't feel great... I'm just thinking-aloud/rambling at this point I guess. Anyway, let me think through which of these two approaches (field-based vs parser-based) I like better, and I'll push up a modification shortly. Thanks for sticking with this through your time off and everything @dsmiley ; hope you had a good trip! |
The more I think through this today, the more I start to wonder whether I was wrong to oppose an "opt-in" solution originally. It doesn't benefit all users the way an opt-out solution would, but maybe that's for the best given the high level of concern around annoyance and false-positives. Shrewd users, and folks who especially need it (e.g. the "service" use case Mark mentioned) would get a really comprehensive limit at their disposal once they enable it. And of course, it'd keep us out of the business of trying to guess which queries are likely to be "user queries". Is that something you could get behind @dsmiley ? |
One idea is that maybe we would want to have a page in the ref guide that was "The Defensive Solr Setup" and you could reference this type of feature and other features that enable this? Maybe someday we ship a configset that represents a configuration that is very protective? To make it easier for non experts to adopt this type of feature? |
"user" query parsers: complexphrase, dismax, edismax. All others seem internal to me. We could have an abstract method on QParserPlugin (or QParser) that compels plugin authors to indicate what the default prefix query limit should be (if any). Field types: (I accept this approach too, albeit I like it less)
I'm unsure exactly what you are asking me. I'm definitely behind "opt-in" :-) |
Sorry - I should've been more concrete. I'm asking if you could live with a parser-agnostic (i.e. global) limit like what the PR currently implements, if it was made "opt-in" rather than "opt-out". AFAICT that combination would address your concerns around false-positives, making work for admins, etc. And it'd be my preference over a parser-aware approach, as I'm not totally comfortable with the idea of inferring query origin from the parser implementation. If you're -1 on any sort of a "global" limit, even as an opt-in feature, I'll implement parser-aware. But I didn't want to do that without checking where you stood on the global+optIn approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can have it both ways. Opt-in (thus no limit by default). And a relatively minor change in this PR to consult the QParser which really is the right place to have a hook for this. The FieldType can continue to be involved, but I don't see that this is a FieldType specific option.
@@ -165,6 +167,20 @@ public Query getFieldTermQuery(QParser parser, SchemaField field, String externa | |||
return new TermQuery(new Term(field.getName(), br)); | |||
} | |||
|
|||
@Override | |||
public Query getPrefixQuery(QParser parser, SchemaField sf, String termStr) { | |||
final var query = super.getPrefixQuery(parser, sf, termStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why override these methods in TextField & StrField with duplicative code when you could modify FieldType.getPrefixQuery
instead? Doing it there also allows the cool existence query thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why override these methods in TextField & StrField with duplicative code when you could modify FieldType.getPrefixQuery instead?
Moving the logic into FieldType.getPrefixQuery
would have it apply to all field types, which I'm not sure we want. The limit will probably always be unhelpful for "boolean" and "enum" type fields for instance, since they're generally so low cardinality as to be immune to the "runaway prefix query" problem.
For now I've acquiesced and moved the logic to FieldType
, but curious to hear your thoughts on some of those exceptional cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the limit doesn't make much sense for those two yet it's also harmless (I think). I prefer a consistent approach with no duplication. Thanks for doing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well there is some harm, albeit small. A prefix query on a "known safe" field type like Enum might trigger the limit, and cause the "Admin aggravation/annoyance" we've discussed and tried to minimize throughout this PR.
That'd require a user to submit a query like my_enum_field:i*
, which is presumably a pretty rare case. But you can imagine a user maybe doing that as a shorthand way to get docs matching a few target enum values
@@ -134,6 +135,19 @@ protected Query newWildcardQuery(org.apache.lucene.index.Term t) { | |||
} | |||
} | |||
|
|||
@Override | |||
protected Query getPrefixQuery(String field, String termStr) throws ParseException { | |||
// TODO check the field type and call QueryUtils.ensureBlah here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear why this parser needs modifications like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this in testing initially and was surprised as well.
It looks like ComplexPhraseQParserPlugin assumes "text" and doesn't rely on FieldType (or its subclasses) to do query-construction in a schema-aware manner. I'd be curious to see what other QParsers act similarly, but I'm not familiar enough with our query-parsing code generally to say whether it makes sense or might be problematic. But definitely orthogonal to this PR either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully a simple fix to this parser to consult the field type will work. That's the right thing to do. Perhaps whoever wrote this didn't know better or it didn't exist at the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did find a way to make this change, but ended up backing off of it. (I pushed that change to this PR but then reverted it, if you're curious to what was involved.)
A few reasons for that back off:
- It ended up causing some complications on the Lucene-side. ComplexPhraseQueryParser (the Lucene class) makes certain type assumptions about the Query instances it handles. Injecting our own FieldType/schema aware query-creation logic into that class breaks those expectations in a pretty big way and causes a lot of test failures 😦
- It's orthogonal to my main goal in this PR, and I'm leery of scope creep.
- If we do want to address this, it's probably something that merits its own ticket to investigate and fix. After all, this probably isn't the only place in Solr that calls directly into a Lucene QueryParser without somehow creating sub-queries in a FieldType-aware way.
@@ -83,6 +85,25 @@ public static boolean isConstantScoreQuery(Query q) { | |||
} | |||
} | |||
|
|||
public static void ensurePrefixQueryObeysMinimumPrefixLength( | |||
Query query, String prefix, int minPrefixLength) { | |||
// TODO Should we provide a query-param to disable the limit on a request-by-request basis? I |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes indeed; in fact the QParser default algorithm could lookup a local-param (not request level) like prefixQueryMinimumLength. This will be straightforward once you create the QParser default method as I suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
minimum length requirement. Prefix queries that don't meet this requirement return an | ||
error to users. | ||
--> | ||
<minPrefixLength>${solr.min.prefixLength:2}</minPrefixLength> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again; name is confusing and I disagree with the choice of solr.min.prefixLength
. The second component should be a category like "query", so could be "solr.query.minPrefixLength" for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid fragmentation, let's continue XML-tag naming discussion in this thread here.
For the governing sys prop, I've changed that to 'solr.query.minPrefixLength' as suggested.
@@ -235,6 +235,18 @@ This is the same system property used in the xref:configuring-solr-xml.adoc#glob | |||
<maxBooleanClauses>${solr.max.booleanClauses:1024}</maxBooleanClauses> | |||
---- | |||
|
|||
=== <minPrefixLength> Element |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm; weird that a documentation file named "caches-warming" would document this query specific limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Looking at the file it, it covers everything in the <query>
section of solrconfig.xml. A bunch of that is cache related, but not all of it obviously.
Maybe the scope of the file changed over time and no one ever noticed as the name became more and more out of date? Or perhaps someone noticed, but the name was intentionally retained for SEO or other reasons since it appears in the ref-guide URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should keep this named as such for legacy SEO purposes. Any way; it'd be a follow-up PR, not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe not SEO purposes. But I know there have been discussions about tracking changes to page names (I think for the purposes of the version-switcher our ref-guide has now?)
Anyway, something for another day...
Alright @dsmiley - I took in most of your suggestions:
Lmk what you think; hopefully this is close. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely in great shape.
solr/core/src/java/org/apache/solr/search/ComplexPhraseQParserPlugin.java
Show resolved
Hide resolved
Causes type inference issues on the Lucene side, but providing on this PR as an example for anyone curious.
This reverts commit df18b2b.
AFAIK - the last outstanding question here is whether to bundle in a separate change to make ComplexPhraseQParser be schema-aware in how it creates its queries. There's a few technical hurdles there unfortunately - so my plan (unless I hear any vetos) is to punt on that for now. Full discussion of the idea and subsequent issues here.. I'll give a little more time for feedback, but I'll look to merge sometime this week otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 looks good.
Prefix queries have a memory cost that increases in proportion with the number of indexed terms that start with that prefix. This can cause stability problems, especially for short prefixes which can match a large number of indexed-terms. This commit introduces a new solrconfig.xml setting, `minPrefixQueryTermLength`, similar to the existing `maxBooleanClauses`. This setting causes an error to be thrown any time a prefix query is created with a prefix shorter than the configured minumum. The limit is set at '-1' in the default configset, essentially disabling it. But users may change the value in their own configs or use the 'solr.query.minPrefixLength' property as an override. The limit may also be overridden on a per-query basis by providing a 'minPrefixQueryTermLength' local param.
https://issues.apache.org/jira/browse/SOLR-17195
Description
Prefix-based queries consume memory in proportion to the number of terms in the index that start with the prefix. Short prefixes tend to match many more indexed terms, and consume more memory as a result, often causing instability issues on the node.
Yet Solr (prior to this PR) offers no way to restrict the prefixes used in queries.
Solution
This PR adds a solrconfig.xml property,
minPrefixLength
, which operates similarly to the existingmaxBooleanClauses
. Users who submit a query with a prefix shorter than the minimal acceptable length will receive an error of the form:Some notes on the implementation:
solr.min.prefixLength
sysprop/env-varTests
Tests for solrconfig.xml in
SolrCoreTest
. Tests for the limiting itself inPrefixQueryTest
.Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.