Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17195: Add 'minPrefixLength' soft limit #2499

Merged
merged 12 commits into from
Jul 9, 2024

Conversation

gerlowskija
Copy link
Contributor

@gerlowskija gerlowskija commented Jun 5, 2024

https://issues.apache.org/jira/browse/SOLR-17195

Description

Prefix-based queries consume memory in proportion to the number of terms in the index that start with the prefix. Short prefixes tend to match many more indexed terms, and consume more memory as a result, often causing instability issues on the node.

Yet Solr (prior to this PR) offers no way to restrict the prefixes used in queries.

Solution

This PR adds a solrconfig.xml property, minPrefixLength, which operates similarly to the existing maxBooleanClauses. Users who submit a query with a prefix shorter than the minimal acceptable length will receive an error of the form:

Query <snip> does not meet the minimum prefix length [2] (actual=[1]).  Please try with a larger prefix, or adjust minPrefixLength in your solrconfig.xml

Some notes on the implementation:

  • currently only enforced for 'string' and 'text' fields, where this cardinality problem occurs most frequently
  • defaults to '2' in the default configset, prohibiting only single-character prefixes.
  • can be overridden in the default configset with the solr.min.prefixLength sysprop/env-var

Tests

Tests for solrconfig.xml in SolrCoreTest. Tests for the limiting itself in PrefixQueryTest.

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Prefix queries have a memory cost that increases in proportion with the
number of indexed terms that start with that prefix.  This can cause
stability problems, especially for short prefixes which can match a
large number of indexed-terms.

This commit introduces a new solrconfig.xml setting, `minPrefixLength`,
similar to the existing `maxBooleanClauses`.  This setting causes an
error to be thrown any time a prefix query is created with a prefix
whose length is less than the configured minumum.

This setting is set at '2' by default, disallowing only single-character
prefix queries.

Still TODO:
  - ref guide documentation
  - more error-case testing
  - handle the faceting invocation
  - consider whether this should be disable-able on a per request basis
@gerlowskija
Copy link
Contributor Author

gerlowskija commented Jun 5, 2024

Three questions I'd like to answer before this is merged:

  1. I can imagine scenarios where an advanced user might want to enable this feature, but still be able to exempt certain queries. (Perhaps they know that certain fields are low-cardinality enough to be "safe", etc.) Is it worth adding a query-param feature flag, that can be used to disable the limit on a query-by-query basis?
  2. Some faceting codepaths run prefix queries in the course of processing certain types of facets. Should we enforce this limit on these codepaths or skip it?
  3. Is there any value in expanding the limit to cover field types other than 'string' and 'text'?

Other TODOs:

  • ref-guide docs
  • additional testing
  • tidy/check.

Also expands the setting to govern text field searches as well.
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 5, 2024
@gerlowskija gerlowskija marked this pull request as ready for review June 5, 2024 19:14
Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a query-parser based limit, not a field type enforced one.

@@ -103,6 +103,8 @@ public class SolrConfig implements MapSerializable {
private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());

public static final String DEFAULT_CONF_FILE = "solrconfig.xml";
public static final String MIN_PREFIX_LENGTH = "minPrefixLength";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from a naming standpoint, this is very ambiguous. Prefix of what? And should this be here at all vs. a request parameter, interpreted by some (not all) query parsers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming-wise, I was very much following the model of maxBooleanClauses. But in hindsight, I think you're right that this is too ambiguous.

For now I'll change this to minPrefixQueryTermLength, but if there's another option there you like better, lmk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the "Term" in that add anything? I suggest removing that component. Not a strong opinion though.

@dsmiley
Copy link
Contributor

dsmiley commented Jun 6, 2024

I wish we brainstormed this a bit on the dev list or something prior to you showing up with a PR where there is maybe inertia to continue with it the way it is

@gerlowskija
Copy link
Contributor Author

gerlowskija commented Jun 7, 2024

I don't really think that's a fair characterization. The JIRA ticket was open and advertised to all of us on issues@ 3 months ago, and no one chimed in there.

I could've created a dev@ thread to discuss approaches. And now knowing your preference, maybe I will next time. But obviously not every JIRA needs a dev@ thread. I used my best judgement. Sometimes the brainstorming and feedback just doesn't appear until there's a 'patch' to stimulate it, so I wrote one 🤷

In any case, I'm happy to discuss/brainstorm the overall approach now. I think it stands on its own without considering "inertia".


In short, my thought process was that the more broadly applicable the limit is, the more effective it will be at protecting stability. It's hard to achieve stability if the safeguards you're relying on are opt-in, "off" by default, or only apply to a subset of relevant queries. So I wanted to make the "minPrefix" limit be opt-out, apply broadly, etc.

Solr already has a query-time limit that fits this "broadly applicable" model pretty well, also in service of stability: maxBooleanClauses. mBC has caused its share of issues, but the underlying model is sound and has been a part of Solr for at least a decade. Users are already familiar with this sort of limit coming in under solrconfig.xml, etc.

So that's mostly what I've tried to do in this PR: (1) make it "broad" by design, and (2) follow mBC as a model in terms of how to configure, document, and implement.

@dsmiley - You noted above your preference for something parser-specific. Can you elaborate on that preference a bit?

@dsmiley
Copy link
Contributor

dsmiley commented Jun 7, 2024

I'm sorry for challenging your due diligence on soliciting input.

In short, my thought process was that the more broadly applicable the limit is, the more effective it will be at protecting stability. It's hard to achieve stability if the safeguards you're relying on are opt-in, "off" by default, or only apply to a subset of relevant queries. So I wanted to make the "minPrefix" limit be opt-out, apply broadly, etc.

This runs the real risk of an obscure limit showing itself in production that would hot have been caught in a test setup for a use-case that should absolutely not have such a limit. Generally "q" is provided by a user, uses a more user-facing query parser (namely edismax), and would be a nice place for such a limit. But otherwise, queries show up in tons of other places for internal use-cases written by systems programmers (i.e. us devs) using the default "lucene" query parser where I don't think such a limit at all makes sense (at least not by default). Random examples -- delete-by-query, facet.query, join query.

The same is so for maxBooleanClauses! I was hoping you wouldn't bring that up but of course you did; it's the precedent for your decision. That unfortunate precedent caused a huge ugly debate with only one person holding veto power that insisted on maxBooleanClauses as it exists today. That might have made Yonik leave to do Heliosearch over, if there was a huge straw break that back. I hope we don't repeat that mistake. I've only ever increased maxBooleanClauses to something insane, not because there isn't a risk of users doing expensive queries that we want to limit, but instead because the correct (IMO) place to enforce query constraints is at the query parser. And we do this where I work for various things that in-effect constraint the query complexity.

@markrmiller you and most others were involved in that ugly debate... what do you think of Jason's approach here?

@gerlowskija
Copy link
Contributor Author

gerlowskija commented Jun 10, 2024

Generally "q" is provided by a user, uses a more user-facing query parser (namely edismax), and would be a nice place for such a limit. But otherwise, queries show up in tons of other places for internal use-cases written by systems programmers (i.e. us devs) using the default "lucene" query parser

I'm not sure the edismax=user-queries, lucene=system-queries generalization is a safe one to make.

And even if it is, I think you're making an additional assumption here that "system" queries wouldn't also benefit from this check. I'm not sure I agree. System-devs are more likely than the average user to know about Solr's performance pitfalls, but is that safe to assume across the board? I think our many years with the project put us in danger of overestimating the knowledge and context other users might have, and underestimating the value of safeguards.

That's the key issue for me IMO - if I was more comfortable with those generalizations I'd be happy with an edismax fix. But as-is they feel like pretty shaky ground to base cluster-stability on.

I've only ever increased maxBooleanClauses to something insane, not because there isn't a risk of users doing expensive queries that we want to limit

maxBooleanClauses is, admittedly, a pain. For one, it's terribly calibrated, or at least terribly out of date. (Is 1024 clauses even a lot on today's hardware? It seems like nothing!). I'm not here to defend it. And I can't speak to the old debate without reading up on it. (Do you have a pointer handy @dsmiley ?)

But conceptually I think maxBooleanClauses takes the right response to a "DANGER" stimulus. If something has the potential to take down a cluster a global block is the safest thing to do. It's not ideal, but in the absence of other data to block in a more targeted manner, stability is worth prioritizing. Solr should be stable by default, even if that means admins have a little extra work to do to "unlock" certain query patterns. That's mostly what I was trying to convey by the comparison to mBC

@gerlowskija
Copy link
Contributor Author

gerlowskija commented Jun 10, 2024

Generally "q" is provided by a user, uses a more user-facing query parser (namely edismax), and would be a nice place for such a limit. But otherwise, queries show up in tons of other places for internal use-cases written by systems programmers (i.e. us devs) using the default "lucene" query parser

Thinking about this a bit more, I wonder if the real issue here is that there's no way to tell Solr unequivocally, on a request-by-request basis: "Don't second-guess me, I know this is safe". If users had a way to communicate that, there wouldn't be any need for us as developers to try to generalize about which queries came from "end" users, what users may or may not know about Solr performance, etc.

Would adding something like that make you feel any better about the parser-agnostic approach @dsmiley ? It'd give savvy "system" users a way to sidestep the "hitting a limit on a safe query in prod" false-positive you expressed concern about above.

@markrmiller
Copy link
Member

markrmiller commented Jun 11, 2024

“what do you think of Jason's approach here?”

It is a kind of thorny practice that evades a good definitive judgment.

I suppose the idea is, at some, perhaps fairly arbitrary limit, you fail things to force the user to reckon with what they are doing more carefully and actively raise limits if they think that's the correct action. I say perhaps fairly arbitrary because it's difficult to do any better, given varying hardware and data.

The cost is that is that things that worked fine in testing may start blowing up randomly in production, perhaps needlessly, until you address the situation.
Thorny. Benifit / reward, what lines are drawn where, it's all pretty subjective. So I don't know.

As a user, I certainly would want to be able to opt in, though yes, that makes the effort likely of little value, particularly if it's opt-in per limit. A universal opt-in would be better but would likely still end up as little value then. As a user, I'll still be pissed when production starts failing things the first time, and I judge it should not be doing that for my case. And at that point, I would be very unhappy if there was not a universal opt-out. Thorny.

I do see the value, but in most cases, that value would not likely be high enough to convince me as a user that this was a good default system practice. Other than for very particular things, it doesn't seem like a common practice in other systems because of how thorny it is.

I think some limits have a bit more clear value, though. Things that can build up over time and then are very difficult to address. Something along the lines of no limit to the number of fields on a document, where things can work fine for the first 6 months, but over time, you are using more and more fields and eventually the system just kind of seizes under the weight, and you can't do much but start over. Maybe hitting something much earlier that said you are using a strangely large number of fields, this is not a very scable approach, you have to change a configuration to continue down this path, would make a lot more sense.

@markrmiller
Copy link
Member

One use case where opt-in still has a lot of value is when one team runs Solr as a service or for other teams. In that case, it's much more of an available feature than automatic user self protection.

@gerlowskija
Copy link
Contributor Author

@markrmiller mentioned to me offline that it was clear that my preference was for cluster-stability over convenience/admin-friendliness, but that I never made the rationale behind that preference explicit. He suggested I take one last crack at making the case for across-the-board limiting, as explicitly as possible. So here it is:

Ultimately my rationale is pretty simple: aggressive prefix queries have the potential to block all traffic if they trigger an OOM. That's a worst-case-scenario for an administrator.

There's pain for administrators either way. Realizing that some setting has errantly blocked some "known safe" queries definitely brings a serious "annoyance-factor". But to put myself in an administrator's shoes, I'd rather deal with that on 10, 20, even 100 clusters than deal with an "all traffic" outage at 3am because of a runaway prefix query.


So that's my last pitch for across-the-board limiting. I'll give this a few days to see if it sways anyone; if no one else is convinced enough to chime in or reverse course though I'll go forward with something like the more limited approach that David championed.

@dsmiley
Copy link
Contributor

dsmiley commented Jun 21, 2024

I've been on vacation lately.
I suggested a limit based on the query parser. This suggestion isn't limited to edismax; it'd include some others that users are likely to use. Another angle is to employ this limit only on a TextField, which generally should be used for keyword search and thus user queries. I think I can be comfortable with that. But not StrField in particular (in terms of opt-in vs opt-out). That's where my concern is most likely to be used in practice.

@gerlowskija
Copy link
Contributor Author

gerlowskija commented Jun 25, 2024

I suggested a limit based on the query parser. This suggestion isn't limited to edismax; it'd include some others that users are likely to use.

Did you have particular parsers in mind @dsmiley ? Or is there somewhere else that the project relied on this "end-user" vs "admin" parser distinction, that I could get a list from?

I'm on board with expanding the set of parsers that will enforce this limit, but I don't have a great sense what you or others would consider "end-user" parsers. If you have a concrete subset in mind, then I'm happy to run with that.

Another angle is to employ this limit only on a TextField, which generally should be used for keyword search and thus user queries.

I think I agree that "field-type" is a better predictor of "user query or not" than the query-parser would be.

Maybe my experience is idiosyncratic, but prefix queries on TextField aren't the common case from what I've seen. Anecdotally, prefix queries are much much more common on StrField.

We could apply the limit only on TextField, but if that only covers (e.g.) 20% of the prefix queries that'll hit the index - is it providing stability? I guess it still could, if the terms dictionary was large enough. But still, leaving the majority prefix-query use-case uncovered doesn't feel great... I'm just thinking-aloud/rambling at this point I guess.


Anyway, let me think through which of these two approaches (field-based vs parser-based) I like better, and I'll push up a modification shortly.

Thanks for sticking with this through your time off and everything @dsmiley ; hope you had a good trip!

@gerlowskija
Copy link
Contributor Author

The more I think through this today, the more I start to wonder whether I was wrong to oppose an "opt-in" solution originally.

It doesn't benefit all users the way an opt-out solution would, but maybe that's for the best given the high level of concern around annoyance and false-positives. Shrewd users, and folks who especially need it (e.g. the "service" use case Mark mentioned) would get a really comprehensive limit at their disposal once they enable it. And of course, it'd keep us out of the business of trying to guess which queries are likely to be "user queries".

Is that something you could get behind @dsmiley ?

@epugh
Copy link
Contributor

epugh commented Jun 26, 2024

One idea is that maybe we would want to have a page in the ref guide that was "The Defensive Solr Setup" and you could reference this type of feature and other features that enable this? Maybe someday we ship a configset that represents a configuration that is very protective? To make it easier for non experts to adopt this type of feature?

@dsmiley
Copy link
Contributor

dsmiley commented Jun 26, 2024

"user" query parsers: complexphrase, dismax, edismax. All others seem internal to me. We could have an abstract method on QParserPlugin (or QParser) that compels plugin authors to indicate what the default prefix query limit should be (if any).

Field types: (I accept this approach too, albeit I like it less)

Is that something you could get behind @dsmiley ?

I'm unsure exactly what you are asking me. I'm definitely behind "opt-in" :-)

@gerlowskija
Copy link
Contributor Author

I'm unsure exactly what you are asking me. I'm definitely behind "opt-in" :-)

Sorry - I should've been more concrete. I'm asking if you could live with a parser-agnostic (i.e. global) limit like what the PR currently implements, if it was made "opt-in" rather than "opt-out".

AFAICT that combination would address your concerns around false-positives, making work for admins, etc. And it'd be my preference over a parser-aware approach, as I'm not totally comfortable with the idea of inferring query origin from the parser implementation.

If you're -1 on any sort of a "global" limit, even as an opt-in feature, I'll implement parser-aware. But I didn't want to do that without checking where you stood on the global+optIn approach.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have it both ways. Opt-in (thus no limit by default). And a relatively minor change in this PR to consult the QParser which really is the right place to have a hook for this. The FieldType can continue to be involved, but I don't see that this is a FieldType specific option.

solr/core/src/java/org/apache/solr/schema/StrField.java Outdated Show resolved Hide resolved
@@ -165,6 +167,20 @@ public Query getFieldTermQuery(QParser parser, SchemaField field, String externa
return new TermQuery(new Term(field.getName(), br));
}

@Override
public Query getPrefixQuery(QParser parser, SchemaField sf, String termStr) {
final var query = super.getPrefixQuery(parser, sf, termStr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why override these methods in TextField & StrField with duplicative code when you could modify FieldType.getPrefixQuery instead? Doing it there also allows the cool existence query thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why override these methods in TextField & StrField with duplicative code when you could modify FieldType.getPrefixQuery instead?

Moving the logic into FieldType.getPrefixQuery would have it apply to all field types, which I'm not sure we want. The limit will probably always be unhelpful for "boolean" and "enum" type fields for instance, since they're generally so low cardinality as to be immune to the "runaway prefix query" problem.

For now I've acquiesced and moved the logic to FieldType, but curious to hear your thoughts on some of those exceptional cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the limit doesn't make much sense for those two yet it's also harmless (I think). I prefer a consistent approach with no duplication. Thanks for doing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well there is some harm, albeit small. A prefix query on a "known safe" field type like Enum might trigger the limit, and cause the "Admin aggravation/annoyance" we've discussed and tried to minimize throughout this PR.

That'd require a user to submit a query like my_enum_field:i*, which is presumably a pretty rare case. But you can imagine a user maybe doing that as a shorthand way to get docs matching a few target enum values

@@ -134,6 +135,19 @@ protected Query newWildcardQuery(org.apache.lucene.index.Term t) {
}
}

@Override
protected Query getPrefixQuery(String field, String termStr) throws ParseException {
// TODO check the field type and call QueryUtils.ensureBlah here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear why this parser needs modifications like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this in testing initially and was surprised as well.

It looks like ComplexPhraseQParserPlugin assumes "text" and doesn't rely on FieldType (or its subclasses) to do query-construction in a schema-aware manner. I'd be curious to see what other QParsers act similarly, but I'm not familiar enough with our query-parsing code generally to say whether it makes sense or might be problematic. But definitely orthogonal to this PR either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully a simple fix to this parser to consult the field type will work. That's the right thing to do. Perhaps whoever wrote this didn't know better or it didn't exist at the time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did find a way to make this change, but ended up backing off of it. (I pushed that change to this PR but then reverted it, if you're curious to what was involved.)

A few reasons for that back off:

  1. It ended up causing some complications on the Lucene-side. ComplexPhraseQueryParser (the Lucene class) makes certain type assumptions about the Query instances it handles. Injecting our own FieldType/schema aware query-creation logic into that class breaks those expectations in a pretty big way and causes a lot of test failures 😦
  2. It's orthogonal to my main goal in this PR, and I'm leery of scope creep.
  3. If we do want to address this, it's probably something that merits its own ticket to investigate and fix. After all, this probably isn't the only place in Solr that calls directly into a Lucene QueryParser without somehow creating sub-queries in a FieldType-aware way.

solr/core/src/java/org/apache/solr/search/QueryUtils.java Outdated Show resolved Hide resolved
@@ -83,6 +85,25 @@ public static boolean isConstantScoreQuery(Query q) {
}
}

public static void ensurePrefixQueryObeysMinimumPrefixLength(
Query query, String prefix, int minPrefixLength) {
// TODO Should we provide a query-param to disable the limit on a request-by-request basis? I
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed; in fact the QParser default algorithm could lookup a local-param (not request level) like prefixQueryMinimumLength. This will be straightforward once you create the QParser default method as I suggested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

minimum length requirement. Prefix queries that don't meet this requirement return an
error to users.
-->
<minPrefixLength>${solr.min.prefixLength:2}</minPrefixLength>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again; name is confusing and I disagree with the choice of solr.min.prefixLength. The second component should be a category like "query", so could be "solr.query.minPrefixLength" for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid fragmentation, let's continue XML-tag naming discussion in this thread here.

For the governing sys prop, I've changed that to 'solr.query.minPrefixLength' as suggested.

@@ -235,6 +235,18 @@ This is the same system property used in the xref:configuring-solr-xml.adoc#glob
<maxBooleanClauses>${solr.max.booleanClauses:1024}</maxBooleanClauses>
----

=== <minPrefixLength> Element
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm; weird that a documentation file named "caches-warming" would document this query specific limit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Looking at the file it, it covers everything in the <query> section of solrconfig.xml. A bunch of that is cache related, but not all of it obviously.

Maybe the scope of the file changed over time and no one ever noticed as the name became more and more out of date? Or perhaps someone noticed, but the name was intentionally retained for SEO or other reasons since it appears in the ref-guide URL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should keep this named as such for legacy SEO purposes. Any way; it'd be a follow-up PR, not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe not SEO purposes. But I know there have been discussions about tracking changes to page names (I think for the purposes of the version-switcher our ref-guide has now?)

Anyway, something for another day...

@gerlowskija
Copy link
Contributor Author

Alright @dsmiley - I took in most of your suggestions:

  1. The logic for determining the limit now lives in QParser
  2. QParser looks at the solrconfig.xml setting, but uses a local-param override if specified
  3. The setting is now "opt-in", with the flag value "-1" used in the default configset.

Lmk what you think; hopefully this is close.

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely in great shape.

Causes type inference issues on the Lucene side, but providing on this
PR as an example for anyone curious.
@gerlowskija
Copy link
Contributor Author

AFAIK - the last outstanding question here is whether to bundle in a separate change to make ComplexPhraseQParser be schema-aware in how it creates its queries. There's a few technical hurdles there unfortunately - so my plan (unless I hear any vetos) is to punt on that for now. Full discussion of the idea and subsequent issues here..

I'll give a little more time for feedback, but I'll look to merge sometime this week otherwise

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 looks good.

@gerlowskija gerlowskija merged commit 27ef6d5 into apache:main Jul 9, 2024
2 of 3 checks passed
@gerlowskija gerlowskija deleted the SOLR-17195-min-prefix-soft-limit branch July 9, 2024 17:29
gerlowskija added a commit that referenced this pull request Jul 9, 2024
Prefix queries have a memory cost that increases in proportion with the
number of indexed terms that start with that prefix.  This can cause
stability problems, especially for short prefixes which can match a
large number of indexed-terms.

This commit introduces a new solrconfig.xml setting,
`minPrefixQueryTermLength`, similar to the existing
`maxBooleanClauses`.  This setting causes an error to be thrown any
time a prefix query is created with a prefix shorter than the configured
minumum.

The limit is set at '-1' in the default configset, essentially disabling it.  But
users may change the value in their own configs or use the
'solr.query.minPrefixLength' property as an override.  The limit may also be
overridden on a per-query basis by providing a 'minPrefixQueryTermLength'
local param.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:schema cat:search configs documentation Improvements or additions to documentation tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants