Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SearchFilter time grows exponentially by # of search terms #4655

Closed
5 of 6 tasks
cdosborn opened this issue Nov 4, 2016 · 11 comments · Fixed by #5264
Closed
5 of 6 tasks

SearchFilter time grows exponentially by # of search terms #4655

cdosborn opened this issue Nov 4, 2016 · 11 comments · Fixed by #5264
Assignees
Labels
Milestone

Comments

@cdosborn
Copy link

cdosborn commented Nov 4, 2016

Checklist

  • I have verified that that issue exists against the master branch of Django REST framework.
  • I have searched for similar issues in both open and closed tickets and cannot find a duplicate.
  • This is not a usage question. (Those should be directed to the discussion group instead.)
  • This cannot be dealt with as a third party library. (We prefer new functionality to be in the form of third party libraries where possible.)
  • I have reduced the issue to the simplest possible case.
  • I have included a failing test as a pull request. (If you are unable to do so we can still accept the issue.)

Steps to reproduce

Use filters.SearchFilter and include a seach_field which is a many to many lookup.

    filter_backends = (filters.SearchFilter)
    search_fields = ('many__to__many__field')

Make a query against this view with several search terms.

Expected behavior

The search time would increase somewhat linearly with the # of terms.

Actual behavior

The search grows exponentially with each added term. In our application several words (3) resulted in a 30 sec query against a model that only had several hundred entries. It would take several minutes for another term and so on.

Summary

I was able to change a single block in drf and the performance became linear as I would expect. The problem and (a potential) solution are known. I wanted to bring them to your attention.

The culprit

Chaining filters in django on querysets doesn't behave as one would expect when dealing with ManyToMany relations. If you look at the gist below, you'll see that the second bit of sql is quite different from the first bit because of this difference.

https://gist.github.com/cdosborn/cb4bdfd0467feaf987476f4aefdf7ee5

From looking at the sql, you'll notice the first bit generated a bunch of unnecessary joins. These joins result in a multiplicative factor on the number of rows that the query contains. Notice how the bottom query doesn't have the redundant joins. So what we can conclude is that chaining filters can produce unnecessary joins which can dramatically effect the performance.

So there is a bit of code in drf, which chains filter for each term in the search query. This explodes whenever the search_fields contains a ManyToMany.

A solution

Rather than chaining filters in SearchFilter we build up a query first, and call filter once.

diff --git a/rest_framework/filters.py b/rest_framework/filters.py
index 531531e..0e7329b 100644
--- a/rest_framework/filters.py
+++ b/rest_framework/filters.py
@@ -144,13 +144,15 @@ class SearchFilter(BaseFilterBackend):
         ]
 
         base = queryset
+        conditions = []
         for search_term in search_terms:
             queries = [
                 models.Q(**{orm_lookup: search_term})
                 for orm_lookup in orm_lookups
             ]
-            queryset = queryset.filter(reduce(operator.or_, queries))
+            conditions.append(reduce(operator.or_, queries))
 
+        queryset = queryset.filter(reduce(operator.and_, conditions))
         if self.must_call_distinct(queryset, search_fields):
             # Filtering against a many-to-many field requires us to
             # call queryset.distinct() in order to avoid duplicate items

This may not be the fix you want. My guess is that the must_call_distinct was trying to fix this problem, but it's not sufficient. My impression is that this is a pretty serious issue that django needs to resolve.

@rpkilby
Copy link
Member

rpkilby commented Nov 4, 2016

These docs are relevant here. At the end of the day, you're getting two different queries that return two completely different sets of results. Regardless of performance, I'd argue that the proposed changes are more correct.

@cdosborn
Copy link
Author

cdosborn commented Nov 6, 2016

Would you like me to submit a PR?

Some thoughts:
Can self.must_call_distinct be removed?
Should probably search for similar uses of filter.
How should we address that this is something django should fix or provide a workaround for (normally I'd inline a comment including a link to the discussion).

@tomchristie
Copy link
Member

How should we address that this is something django should fix or provide a workaround for

If you believe this represents an issue in Django core then raise a ticket on Trac. It'd be worth reviewing what happens in the admin, and if this is replicable in the the search there too. I'd be surprised if the issue hadn't already come up before if that's the case.

@tomchristie tomchristie added this to the 3.5.3 Release milestone Nov 7, 2016
@tomchristie tomchristie added the Bug label Nov 7, 2016
@tomchristie
Copy link
Member

tomchristie commented Nov 7, 2016

Can self.must_call_distinct be removed?

Start by seeing what tests fail if you do remove it. We can then take the conversation from there.

@vstoykov
Copy link
Contributor

In the past we have similar problem with a pure django project (not using django rest framework at all). We used django-tagging and searched in the tags (which are many to many to the object). We used MySQL for database engine and when query string in our form contained a lot of words then MySQL raised that it can not join more than 40 tables (or 41 I can't remember exactly).

We fixed that by using Q objects and or-ing them instead of the querysets.

@rpkilby yes at the end you have two different SQL queries but you still have the same results set because you are using or and not and. In the django docs that you linked they write about using and with many to many relations.

@cdosborn
About self.must_call_distinct it should not be removed. Everytime when you perform query against reverse relation (many to many is de facto reverse relation from both ends) you should call distinct() or you will have duplicates. The only exception is when the reverse relation is one to one.

@rpkilby
Copy link
Member

rpkilby commented Jan 28, 2017

@rpkilby yes at the end you have two different SQL queries but you still have the same results set because you are using or and not and.

@vstoykov - The search fields per term are grouped together with or, however each group is anded together. For example, take ordering_fields = ('name', 'groups__name') and this query:

GET https://localhost/api/users?search=bob,joe

With the existing implementation, we should get a queryset equivalent to the following:

User.objects \
    .filter(Q(name__icontains='bob') | Q(groups__name__icontains='bob') \
    .filter(Q(name__icontains='joe') | Q(groups__name__icontains='joe')

The proposed changes would result in this query:

User.objects.filter((Q(name__icontains='bob') | Q(groups__name__icontains='bob'))
                  & (Q(name__icontains='joe') | Q(groups__name__icontains='joe')))

I'd have to double check, but this seems to fall under the caveats described in the docs.

@vstoykov
Copy link
Contributor

vstoykov commented Jan 30, 2017

@rpkilby Sorry I totally missed operator.and_ in:

queryset = queryset.filter(reduce(operator.and_, conditions))

This will make the situation complex. On one hand the search need to return as many as possible matching results, on other hand it should not DOS the application.

Probably there should be something that can configure this (SearchFilter's argument, or separate class which developers can use) and mentioning in documentation what are the differences and then each project developer will decide which variant to use.

@cdosborn
Copy link
Author

From one point of view, the current behavior is a bug w.r.t to handling m2m. From the docs:

If multiple search terms are used then objects will be returned in the list only if all the provided terms are matched.

As you mentioned, if we went ahead with the changes, then applications would see fewer results.

@vimarshc
Copy link
Contributor

Hey folks,
Wanted to inquire if these changes have been tested by anyone?
I wanted to override the SearchFilter class and add the changes to speed up M2M searches.

@cdosborn
Copy link
Author

cdosborn commented Jun 21, 2017

@vimarshc You may use this as a reference. We monkey-patched the search filter for the mean time.

@tomchristie
Copy link
Member

If anyone wants to progress this issue, I'd suggest making a pull request so we can look at the effects of this change on the current test suite, which would help highlight any problems it might have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants