SearchFilter time grows exponentially by # of search terms #4655

cdosborn · 2016-11-04T18:45:00Z

Checklist

I have verified that that issue exists against the master branch of Django REST framework.
I have searched for similar issues in both open and closed tickets and cannot find a duplicate.
This is not a usage question. (Those should be directed to the discussion group instead.)
This cannot be dealt with as a third party library. (We prefer new functionality to be in the form of third party libraries where possible.)
I have reduced the issue to the simplest possible case.
I have included a failing test as a pull request. (If you are unable to do so we can still accept the issue.)

Steps to reproduce

Use filters.SearchFilter and include a seach_field which is a many to many lookup.

    filter_backends = (filters.SearchFilter)
    search_fields = ('many__to__many__field')

Make a query against this view with several search terms.

Expected behavior

The search time would increase somewhat linearly with the # of terms.

Actual behavior

The search grows exponentially with each added term. In our application several words (3) resulted in a 30 sec query against a model that only had several hundred entries. It would take several minutes for another term and so on.

Summary

I was able to change a single block in drf and the performance became linear as I would expect. The problem and (a potential) solution are known. I wanted to bring them to your attention.

The culprit

Chaining filters in django on querysets doesn't behave as one would expect when dealing with ManyToMany relations. If you look at the gist below, you'll see that the second bit of sql is quite different from the first bit because of this difference.

https://gist.github.com/cdosborn/cb4bdfd0467feaf987476f4aefdf7ee5

From looking at the sql, you'll notice the first bit generated a bunch of unnecessary joins. These joins result in a multiplicative factor on the number of rows that the query contains. Notice how the bottom query doesn't have the redundant joins. So what we can conclude is that chaining filters can produce unnecessary joins which can dramatically effect the performance.

So there is a bit of code in drf, which chains filter for each term in the search query. This explodes whenever the search_fields contains a ManyToMany.

A solution

Rather than chaining filters in SearchFilter we build up a query first, and call filter once.

diff --git a/rest_framework/filters.py b/rest_framework/filters.py
index 531531e..0e7329b 100644
--- a/rest_framework/filters.py
+++ b/rest_framework/filters.py
@@ -144,13 +144,15 @@ class SearchFilter(BaseFilterBackend):
         ]
 
         base = queryset
+        conditions = []
         for search_term in search_terms:
             queries = [
                 models.Q(**{orm_lookup: search_term})
                 for orm_lookup in orm_lookups
             ]
-            queryset = queryset.filter(reduce(operator.or_, queries))
+            conditions.append(reduce(operator.or_, queries))
 
+        queryset = queryset.filter(reduce(operator.and_, conditions))
         if self.must_call_distinct(queryset, search_fields):
             # Filtering against a many-to-many field requires us to
             # call queryset.distinct() in order to avoid duplicate items

This may not be the fix you want. My guess is that the must_call_distinct was trying to fix this problem, but it's not sufficient. My impression is that this is a pretty serious issue that django needs to resolve.

The text was updated successfully, but these errors were encountered:

rpkilby · 2016-11-04T19:24:15Z

These docs are relevant here. At the end of the day, you're getting two different queries that return two completely different sets of results. Regardless of performance, I'd argue that the proposed changes are more correct.

cdosborn · 2016-11-06T18:14:51Z

Would you like me to submit a PR?

Some thoughts:
Can self.must_call_distinct be removed?
Should probably search for similar uses of filter.
How should we address that this is something django should fix or provide a workaround for (normally I'd inline a comment including a link to the discussion).

tomchristie · 2016-11-07T11:27:27Z

How should we address that this is something django should fix or provide a workaround for

If you believe this represents an issue in Django core then raise a ticket on Trac. It'd be worth reviewing what happens in the admin, and if this is replicable in the the search there too. I'd be surprised if the issue hadn't already come up before if that's the case.

tomchristie · 2016-11-07T11:29:17Z

Can self.must_call_distinct be removed?

Start by seeing what tests fail if you do remove it. We can then take the conversation from there.

vstoykov · 2017-01-27T11:55:28Z

In the past we have similar problem with a pure django project (not using django rest framework at all). We used django-tagging and searched in the tags (which are many to many to the object). We used MySQL for database engine and when query string in our form contained a lot of words then MySQL raised that it can not join more than 40 tables (or 41 I can't remember exactly).

We fixed that by using Q objects and or-ing them instead of the querysets.

@rpkilby yes at the end you have two different SQL queries but you still have the same results set because you are using or and not and. In the django docs that you linked they write about using and with many to many relations.

@cdosborn
About self.must_call_distinct it should not be removed. Everytime when you perform query against reverse relation (many to many is de facto reverse relation from both ends) you should call distinct() or you will have duplicates. The only exception is when the reverse relation is one to one.

rpkilby · 2017-01-28T11:28:59Z

@rpkilby yes at the end you have two different SQL queries but you still have the same results set because you are using or and not and.

@vstoykov - The search fields per term are grouped together with or, however each group is anded together. For example, take ordering_fields = ('name', 'groups__name') and this query:

GET https://localhost/api/users?search=bob,joe

With the existing implementation, we should get a queryset equivalent to the following:

User.objects \
    .filter(Q(name__icontains='bob') | Q(groups__name__icontains='bob') \
    .filter(Q(name__icontains='joe') | Q(groups__name__icontains='joe')

The proposed changes would result in this query:

User.objects.filter((Q(name__icontains='bob') | Q(groups__name__icontains='bob'))
                  & (Q(name__icontains='joe') | Q(groups__name__icontains='joe')))

I'd have to double check, but this seems to fall under the caveats described in the docs.

vstoykov · 2017-01-30T12:47:27Z

@rpkilby Sorry I totally missed operator.and_ in:

queryset = queryset.filter(reduce(operator.and_, conditions))

This will make the situation complex. On one hand the search need to return as many as possible matching results, on other hand it should not DOS the application.

Probably there should be something that can configure this (SearchFilter's argument, or separate class which developers can use) and mentioning in documentation what are the differences and then each project developer will decide which variant to use.

cdosborn · 2017-01-30T17:29:26Z

From one point of view, the current behavior is a bug w.r.t to handling m2m. From the docs:

If multiple search terms are used then objects will be returned in the list only if all the provided terms are matched.

As you mentioned, if we went ahead with the changes, then applications would see fewer results.

vimarshc · 2017-06-21T07:56:40Z

Hey folks,
Wanted to inquire if these changes have been tested by anyone?
I wanted to override the SearchFilter class and add the changes to speed up M2M searches.

cdosborn · 2017-06-21T19:58:35Z

@vimarshc You may use this as a reference. We monkey-patched the search filter for the mean time.

tomchristie · 2017-07-10T13:40:58Z

If anyone wants to progress this issue, I'd suggest making a pull request so we can look at the effects of this change on the current test suite, which would help highlight any problems it might have.

tomchristie added this to the 3.5.3 Release milestone Nov 7, 2016

tomchristie added the Bug label Nov 7, 2016

tomchristie modified the milestones: 3.5.3 Release, 3.5.4 Release Nov 7, 2016

tomchristie modified the milestones: 3.5.4 Release, 3.6.1 Release, 3.6.2 Release, 3.6.3 Release Mar 7, 2017

tomchristie modified the milestones: 3.6.3 Release, 3.6.4 Release May 12, 2017

rpkilby self-assigned this Jul 10, 2017

rpkilby pushed a commit to rpkilby/django-rest-framework that referenced this issue Jul 10, 2017

Add failing test for encode#4655

f02b7f1

rpkilby mentioned this issue Jul 10, 2017

Fix SearchFilter to-many behavior/performance #5264

Merged

jpadilla closed this as completed in #5264 Jul 11, 2017

carltongibson mentioned this issue Dec 19, 2018

Used QuerySet.bulk_create() in the SearchFilterToManyTests #6312

Closed

rpkilby mentioned this issue Dec 18, 2019

Doc/test SearchFilter m2m behavior #7094

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SearchFilter time grows exponentially by # of search terms #4655

SearchFilter time grows exponentially by # of search terms #4655

cdosborn commented Nov 4, 2016

rpkilby commented Nov 4, 2016

cdosborn commented Nov 6, 2016

tomchristie commented Nov 7, 2016

tomchristie commented Nov 7, 2016 •

edited by carltongibson

Loading

vstoykov commented Jan 27, 2017

rpkilby commented Jan 28, 2017

vstoykov commented Jan 30, 2017 •

edited

Loading

cdosborn commented Jan 30, 2017

vimarshc commented Jun 21, 2017

cdosborn commented Jun 21, 2017 •

edited

Loading

tomchristie commented Jul 10, 2017

SearchFilter time grows exponentially by # of search terms #4655

SearchFilter time grows exponentially by # of search terms #4655

Comments

cdosborn commented Nov 4, 2016

Checklist

Steps to reproduce

Expected behavior

Actual behavior

Summary

The culprit

A solution

rpkilby commented Nov 4, 2016

cdosborn commented Nov 6, 2016

tomchristie commented Nov 7, 2016

tomchristie commented Nov 7, 2016 • edited by carltongibson Loading

vstoykov commented Jan 27, 2017

rpkilby commented Jan 28, 2017

vstoykov commented Jan 30, 2017 • edited Loading

cdosborn commented Jan 30, 2017

vimarshc commented Jun 21, 2017

cdosborn commented Jun 21, 2017 • edited Loading

tomchristie commented Jul 10, 2017

tomchristie commented Nov 7, 2016 •

edited by carltongibson

Loading

vstoykov commented Jan 30, 2017 •

edited

Loading

cdosborn commented Jun 21, 2017 •

edited

Loading