Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OpenSearch is exposed to ReDoS attack #687

Closed
oridool opened this issue May 11, 2021 · 26 comments
Closed

[BUG] OpenSearch is exposed to ReDoS attack #687

oridool opened this issue May 11, 2021 · 26 comments
Assignees
Labels

Comments

@oridool
Copy link

oridool commented May 11, 2021

Describe the bug
By using a specific regExp query, I am able to cause 100% cpu for a long time.

See also discussion here:
https://discuss.opendistrocommunity.dev/t/is-opendistro-opensearch-exposed-to-redos-attack/5898

To Reproduce

POST regexp-test1/_doc/test01
{
    "stringvalue" : "aaaaaaaaaaaaaaaaaaaaaaaaaasssssssssssssssssssssssssssssss"
}


GET regexp-test1/_search
{
  "query": {
    "regexp": {
      "stringvalue": {
        "value": "(.*a){2000}"
      }
    }
  }
}

GET /_nodes/stats/process

Expected behavior
I would expect the internal Lucene regExp engine to limit the execution after short period. But unfortunately, it doesn’t.

Host/Environment (please complete the following information):

  • OS: Centos Linux 7.x
  • Version: latest OpenSearch built from source (2021/05/11)
@tlfeng
Copy link
Collaborator

tlfeng commented May 15, 2021

Thanks @oridool for providing the detailed steps 👍, the issue is reproducible.
image

@oridool
Copy link
Author

oridool commented May 15, 2021

Thanks @oridool for providing the detailed steps 👍, the issue is reproducible.
image

@tlfeng , I think that the problem eventually relies inside Lucene regexp engine (package org.apache.lucene.util.automaton), and the lack of a timeout or another stop condition in case of infinite loops.
Perhaps adding a limit of 500ms (configurable) in Lucene regexp and abort the execution if timeout exceeds is the approach to be taken in this case.
See how this approach was implemented in .NET:
https://docs.microsoft.com/en-us/dotnet/standard/base-types/best-practices?redirectedfrom=MSDN#use-time-out-values
Also see here nice implementation for the standard Java regex :
https://www.exratione.com/2017/06/preventing-unbounded-regular-expression-operations-in-java/

Ori.

@tlfeng
Copy link
Collaborator

tlfeng commented May 17, 2021

Hi Ori,
Thank you so much for providing your opinion of the solution and the useful links! 👍👍

I spent some time in looking for how Elasticsearch handles the problem. I think currently there is nothing but an excessive setting that disable the regex queries (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#_allow_expensive_queries_6).
Besides, there is an open issue in Elasticsearch repository (https://github.com/elastic/elasticsearch/issues/ 49871) and in my opinion it implies adding the timeout setting you proposed.

I will keep update here. In the meanwhile, we are willing to see contributions if you have time.

(Additional context for others to learn about the ReDoS attack 🙂: https://owasp.org/www-community/attacks/Regular_expression_Denial_of_Service_-_ReDoS)

@oridool
Copy link
Author

oridool commented May 18, 2021

Besides, there is an open issue in Elasticsearch repository (elastic/elasticsearch#49871) and in my opinion it implies adding the timeout setting you proposed.

This issue is talking about limiting Painless scripts loops, not Regexp. From my experience, it was already implemented.

@tlfeng
Copy link
Collaborator

tlfeng commented May 18, 2021

This issue is talking about limiting Painless scripts loops, not Regexp. From my experience, it was already implemented.

Ah, thanks for your idea. Then there is no issues in Elasticsearch repo to deal with the timeout setting. 😅
The meta issue for the Painless safety (https://github.com/elastic/elasticsearch/issues/ 30139) covered the regex issue and the implementation of timeout check for Java was also proposed there, but unfortunately nothing was planned for that kind of solution...
Anyway, it shows that Elasticsearch indeed don't have the plan for this issue.

@tlfeng
Copy link
Collaborator

tlfeng commented May 19, 2021

Hi Ori, we need some time and a deeper look into the solution for the issue. Meanwhile, we would like to help reviewing your PR if you have got solution.

@oridool
Copy link
Author

oridool commented May 19, 2021

Hi Ori, we need some time and a deeper look into the solution for the issue. Meanwhile, we would like to help reviewing your PR if you have got solution.

Hi @tlfeng , unfortunately I don't have a PR / solution for this issue :(

@tlfeng
Copy link
Collaborator

tlfeng commented May 28, 2021

I realized the above Elasticsearch issues I mentioned are not quite related to this issue.
The regular expression in "Painless script" is parsed by standard Java Regex, while the "Regexp query" is parsed by Lucene, along with other kinds of queries.
In my opinion, there can be several options to solve the issue:

  1. (existing) Disable the entire Regexp query in the cluster by setting "search.allow_expensive_queries" to false
  2. Request code changes in Lucene search engine to add the time limit for parsing regex.
  3. Add periodic timeout check when calling Lucene to execute the search (https://github.com/opensearch-project/OpenSearch/blob/1.0/server/src/main/java/org/opensearch/search/query/QueryPhase.java#L354)
  4. Add a timeout check for the entire search request, because there was many defects to the existing "search.default_search_timeout" setting (https://github.com/elastic/elasticsearch/issues/ 30897)

Adding periodic timeout check will impact the search performance, at least when using Regexp query, so the solution needs to be given careful consideration.

@nknize
Copy link
Collaborator

nknize commented May 28, 2021

/cc @mikemccand @rmuir This is a fun find. Is this something either of you have seen before?

@rmuir
Copy link
Contributor

rmuir commented May 28, 2021

we have to look at what the threads are doing (check "hot threads" a few times for stacktraces or use a profiler).

timeout in lucene isn't the answer for this. Any "attack" or "security" nomenclature won't change that, just require auth so that unauthenticated users can't monopolize resources.

Most likely (going on intuition), the performance problem has nothing to do with any index activity and instead is some worst case in finite state algorithms, during query formulation: either determinize() or minimize().

For determinize(), we already provide a limit option called maxDeterminizedStates with what looks like a reasonable default (10000). But maybe something is fishy here / limit not working as it should? Does ES change this default?

Separately, I see RegExp compiler calling minimize(), which seems both unnecessary, and scary. The CPU problem may also be here. Yeah we use https://en.wikipedia.org/wiki/DFA_minimization#Hopcroft's_algorithm which has nice theoretical guarantees, but this stuff is complicated, i wouldn't be surprised if the cpu is being spent here. We removed minimization elsewhere in the finite state queries because it wasn't worth our time, I think just RegExp is the only last one doing it.

anyway, sorry I havent dug in yet, I'm interested, I'm just super busy.

@rmuir
Copy link
Contributor

rmuir commented May 28, 2021

OK i wrote a simple lucene test (attached), this is what I see:

"TEST-TestOpensearch687.testInteresting-seed#[4B9C20A027A9850C]" #15 prio=5 os_prio=0 cpu=56960.04ms elapsed=57.49s tid=0x00007fff7006ca80 nid=0x231c8 runnable  [0x00007fff8b7f0000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.lucene.util.automaton.SortedIntSet.decr(SortedIntSet.java:106)
	at org.apache.lucene.util.automaton.Operations.determinize(Operations.java:769)
	at org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1155)
	at org.apache.lucene.util.automaton.CompiledAutomaton.<init>(CompiledAutomaton.java:247)
	at org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:104)
	at org.apache.lucene.search.AutomatonQuery.<init>(AutomatonQuery.java:82)
	at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:138)
	at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:114)
	at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:72)
	at org.apache.lucene.search.RegexpQuery.<init>(RegexpQuery.java:62)
	at org.apache.lucene.TestOpensearch687.testInteresting(TestOpensearch687.java:42)

TestOpensearch687.java.txt

@tlfeng
Copy link
Collaborator

tlfeng commented May 28, 2021

During my previous testing, I found the rough reason for the issue is the regex engine can take a long time and high CPU usage before determining the total count for the states of a regex.
About the maxDeterminizedStates parameter, it works well, the query parsing will end up with throwing the following error, but it takes a long time to realize the states count.

"caused_by": {
                        "type": "too_complex_to_determinize_exception",
                        "reason": "Determinizing automaton with 52001 states and 84003 transitions would result in more than 10000 states."

And currently there is no limit in the sever side of the maxDeterminizedStates, it can only be set with the search request.

@rmuir
Copy link
Contributor

rmuir commented May 28, 2021

Worst is, the problem comes from this getCommonSuffixBytesRef which should be a silly opto, its actually totally optional. I can confirm after 268s on my 2-core laptop it eventually hits TooComplexToDeterminizeException.

But I don't know why it takes so long :)
If I only let this optimization use up to 1000 states it terminates in 4 seconds instead.

To get the common suffix (which is important for infinite DFAs like this, e.g. "leading wildcard"), We probably shouldn't pass through the original maxDeterminizedStates anyway IMO. Because we are then gonna do some evil shit like reverse the entire DFA and det() it again, then compute common prefix.

So in my opinion the fix to do would be in CompiledAutomaton, high level, we'd change this chunk of code: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/automaton/CompiledAutomaton.java#L243-L247

into something like this:

try {
   // this is slow, and just an opto anyway, so don't burn cycles on it for some crazy worst-case.
   // if we don't set this common suffix, the query will just run a bit slower, that's all.
   int limit = Math.min(1000, maxDeterminizedStates);
   BytesRef suffix = Operations.getCommonSuffixBytesRef(binary, limit);
   ... (setting commonSuffixRef)
} catch (TooComplexTooDeterminizeException notWorthIt) {
  commonSuffixRef = null;
}

It at least works around the issue, unless @mikemccand has a better idea (he knows the det() better than me).

@rmuir
Copy link
Contributor

rmuir commented May 28, 2021

I will open a lucene issue, with the lucene test boiled down from this already-wonderful test case :)

It may give some more visibility, as we have some automata nerds there to look at it.

I think independent of this issue, something similar to my proposed change is a good idea. maybe even with a smaller limit. This is just an optimization so we should never make things worse or terrible computing it, we can just give up.

But I would rather understand why it blows up with the current maxDeterminizedStates limit that we have and not just shove the problem under the rug.

@rmuir
Copy link
Contributor

rmuir commented May 29, 2021

See https://issues.apache.org/jira/browse/LUCENE-9981

Thanks again to @oridool for the easy reproducer. We just add a bunch bunch of java verboseness around it and it becomes a great test for improving the situation.

@mikemccand
Copy link

What a fun adversarial case!

Maybe we could create a more efficient algorithm to find the common substring of an NFA without having to determinize it... I'll try to comment on the Lucene issue.

@nknize
Copy link
Collaborator

nknize commented Jun 1, 2021

Thank you @rmuir and @mikemccand for working the fix. We'll stand by while this bakes in main. In the meantime, we're working to ensure security and auth is enabled by default to ensure this query isn't exposed to non authenticated users.

@oridool
Copy link
Author

oridool commented Jun 23, 2021

@nknize , seems that "fix backported to 8.10.0".
Is OpenSearch going to use this Lucene version ?

@AmiStrn
Copy link
Contributor

AmiStrn commented Jun 23, 2021

@nknize , seems that "fix backported to 8.10.0".
Is OpenSearch going to use this Lucene version ?

@oridool Looks like it is in the roadmap for 2.0 (version 9 of lucene):
https://github.com/orgs/opensearch-project/projects/1#card-61372121

@minalsha minalsha added v2.0.0 Version 2.0.0 Indexing & Search v1.1.0 Issues, PRs, related to the 1.1.0 release labels Aug 10, 2021
@minalsha minalsha removed the v2.0.0 Version 2.0.0 label Aug 10, 2021
@minalsha minalsha assigned nknize and unassigned tlfeng Aug 10, 2021
@minalsha
Copy link
Contributor

@nknize is looking into this issue.

@tlfeng
Copy link
Collaborator

tlfeng commented Aug 23, 2021

Looks like the issue will be resolved in Lucene 8.10 and 9.0 through the JIRA ticket.

Got some information from @nknize, that we will keep close with the Lucene releases and try to upgrade whatever major/minor version we're on with the latest major/minor of Lucene.
So OpenSearch 1.x will upgrade to Lucene 8.10, as soon as it released.

But Lucene 8.10 hasn't been released yet, and OpenSearch 1.1 release is very close, I'm afraid we won't have enough time to test Lucene 8.10 in OpenSearch 1.1, so the issue is not likely to be resolved in OpenSearch 1.1.

@nknize
Copy link
Collaborator

nknize commented Sep 13, 2021

Bumping to v1.2.0 per @tlfeng update.

@nknize nknize added v1.2.0 Issues related to version 1.2.0 and removed v1.1.0 Issues, PRs, related to the 1.1.0 release labels Sep 13, 2021
@tlfeng
Copy link
Collaborator

tlfeng commented Sep 30, 2021

Lucene 8.10 released. 🎉
According to the release notes (https://lucene.apache.org/core/corenews.html#apache-lucenetm-8100-available), an optimization resolved the issue.

RegexpQuery's detection of adversarial (ReDoS) regular expressions is improved, catching exotic cases that it missed before, and throwing TooComplexToDeterminizeException.

@CEHENKLE
Copy link
Member

We'll be moving to 8.10 (#1413) with 1.2 so should be able to close this out :)

@peternied
Copy link
Member

@nknize @mch2 What is the state of this issue? OpenSearch was marked as feature complete per #1317

@nknize
Copy link
Collaborator

nknize commented Nov 17, 2021

This was fixed at the lucene level in the latest release. Closing.

@nknize nknize closed this as completed Nov 17, 2021
ritty27 pushed a commit to ritty27/OpenSearch that referenced this issue May 12, 2024
* changes to allow nulls in arrays

Signed-off-by: Karthik Subramanian <[email protected]>

* changes to allow nulls in arrays

Signed-off-by: Karthik Subramanian <[email protected]>

* updated changelog with correct PR

Signed-off-by: Karthik Subramanian <[email protected]>

* SpotlessJavaCheck violations fixed

Signed-off-by: Karthik Subramanian <[email protected]>

---------

Signed-off-by: Karthik Subramanian <[email protected]>
Co-authored-by: Karthik Subramanian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants