Doc/test SearchFilter m2m behavior #7094

Pomax · 2019-12-16T23:53:37Z

A change to rest_framework's filters.py, introduced in #5264 and pushed out as part of v3.6.4, appears to have changed how searching with multiple filter conditions works when the fields involved are ManyToMany relations rather than plain model fields or ForeignKey relations.

This PR adds a failing test to test_filters.py that highlights a search involving ManyToMany.

(Note that the single-tag tests are in the same file mostly to verify that it really is only multiple filter conditions that are affected)

I tried to verify this against the commit preceding #5264 (f02b7f1) but I was unable to get the tests to run succesfully, even without any new code - even with django pegged to 1.11 the result of runtests --fast is 54 errors, so I'm quite sure how to even check which historical versions of that various dependencies are necessary to make all the pre-existing tests pass.

…t result

rpkilby · 2019-12-17T00:37:58Z

I usually try to run tests via tox. In order to get this to work, I had to unpin the dependencies in requirements-testing.txt, then I could run:

tox -e py27-django111

What I ended up doing was copying the entire setup for SearchFilterToManyTests to its own file so I could run them without affecting the rest of the filter tests. I also had to rename Tag to Tag2 to avoid a model name conflict.

tox -e py27-django111 -- -k  test_filters_multiple

Contents of tests/tests_filters_multiple.py

import datetime

from django.db import models
from django.test import TestCase

from rest_framework import filters, generics, serializers
from rest_framework.test import APIRequestFactory

factory = APIRequestFactory()


class Blog(models.Model):
    name = models.CharField(max_length=20)


class Tag2(models.Model):
    name = models.CharField(max_length=20)


class Entry(models.Model):
    blog = models.ForeignKey(Blog, on_delete=models.CASCADE)
    tags = models.ManyToManyField(Tag2, related_name='entries', blank=True)
    headline = models.CharField(max_length=120)
    pub_date = models.DateField(null=True)


class BlogSerializer(serializers.ModelSerializer):
    class Meta:
        model = Blog
        fields = '__all__'


class SearchFilterToManyTests(TestCase):

    @classmethod
    def setUpTestData(cls):
        b1 = Blog.objects.create(name='Blog 1')
        b2 = Blog.objects.create(name='Blog 2')

        t1 = Tag2.objects.create(name='tag1')
        t2 = Tag2.objects.create(name='tag2')

        # Multiple entries on Lennon published in 1979 - distinct should deduplicate
        e1 = Entry.objects.create(blog=b1, headline='Something about Lennon', pub_date=datetime.date(1979, 1, 1))
        e1.tags.add(t1)
        e1.save()

        e2 = Entry.objects.create(blog=b1, headline='Another thing about Lennon', pub_date=datetime.date(1979, 6, 1))
        e2.tags.add(t2)
        e2.save()

        # Entry on Lennon *and* a separate entry in 1979 - should not match
        e3 = Entry.objects.create(blog=b2, headline='Something unrelated', pub_date=datetime.date(1979, 1, 1))

        e4 = Entry.objects.create(blog=b2, headline='Retrospective on Lennon', pub_date=datetime.date(1990, 6, 1))
        e4.tags.add(t1)
        e4.tags.add(t2)
        e4.save()

    def setUp(self):
        class SearchListView(generics.ListAPIView):
            queryset = Blog.objects.all()
            serializer_class = BlogSerializer
            filter_backends = (filters.SearchFilter,)
            search_fields = (
                '=name',
                'entry__headline',
                '=entry__pub_date__year',
                'entry__tags__name'
            )
        self.SearchListView = SearchListView

    def test_multiple_filter_conditions(self):
        view = self.SearchListView.as_view()
        request = factory.get('/', {'search': 'Lennon,1979'})
        response = view(request)
        assert len(response.data) == 1

    def test_single_filter_condition_manytomany(self):
        view = self.SearchListView.as_view()

        request = factory.get('/', {'search': 'tag1'})
        response = view(request)
        assert len(response.data) == 2

        request = factory.get('/', {'search': 'tag2'})
        response = view(request)
        assert len(response.data) == 2

    def test_multiple_filter_conditions_manytomany(self):
        view = self.SearchListView.as_view()
        request = factory.get('/', {'search': 'tag1,tag2'})
        response = view(request)
        assert len(response.data) == 1

Test failure output

===================================================================================== FAILURES ======================================================================================
______________________________________________________________ SearchFilterToManyTests.test_multiple_filter_conditions ______________________________________________________________
tests/test_filters_multiple.py:77: in test_multiple_filter_conditions
    assert len(response.data) == 1
E   AssertionError: assert 2 == 1
E    +  where 2 = len([OrderedDict([(u'id', 1), ('name', u'Blog 1')]), OrderedDict([(u'id', 2), ('name', u'Blog 2')])])
E    +    where [OrderedDict([(u'id', 1), ('name', u'Blog 1')]), OrderedDict([(u'id', 2), ('name', u'Blog 2')])] = <Response status_code=200, "text/html; charset=utf-8">.data
________________________________________________________ SearchFilterToManyTests.test_multiple_filter_conditions_manytomany _________________________________________________________
tests/test_filters_multiple.py:94: in test_multiple_filter_conditions_manytomany
    assert len(response.data) == 1
E   AssertionError: assert 2 == 1
E    +  where 2 = len([OrderedDict([(u'id', 1), ('name', u'Blog 1')]), OrderedDict([(u'id', 2), ('name', u'Blog 2')])])
E    +    where [OrderedDict([(u'id', 1), ('name', u'Blog 1')]), OrderedDict([(u'id', 2), ('name', u'Blog 2')])] = <Response status_code=200, "text/html; charset=utf-8">.data

---------- coverage: platform darwin, python 2.7.10-final-0 ----------
Coverage XML written to file coverage.xml

================================================================ 2 failed, 1 passed, 1056 deselected in 4.82 seconds ================================================================

Pomax · 2019-12-17T18:39:08Z

I followed your approach and created a separate file with the test introduced for #5264, which fails on 3.6.3 (as expected) and passes on 3.6.4 (as expected), plus my test rewritten to explicitly fetch Entry objects (renamed to EntryRecord also due to name collision) rather than going through blog posts, and that passes on 3.6.3 (finding 1 entry, as expected) but fails on 3.6.4 (finding 0 entries, which was the problem I ran into in our own code base as well).

I have updated the PR accordingly.

Pomax · 2019-12-17T18:43:54Z

Test result for 3.6.3: single expected failure, for the case that #5264 was intended to fix, but an expected pass for manytomany search:

tests/test_filters_multiple.py F..                                                                                                                                                                    [100%]

================================================================================================= FAILURES ==================================================================================================
__________________________________________________________________________ SearchFilterToManyTests.test_multiple_filter_conditions __________________________________________________________________________
tests/test_filters_multiple.py:94: in test_multiple_filter_conditions
    assert len(response.data) == 1
E   AssertionError: assert 2 == 1
E    +  where 2 = len([OrderedDict([(u'id', 1), ('name', u'Post 1')]), OrderedDict([(u'id', 2), ('name', u'Post 2')])])
E    +    where [OrderedDict([(u'id', 1), ('name', u'Post 1')]), OrderedDict([(u'id', 2), ('name', u'Post 2')])] = <Response status_code=200, "text/html; charset=utf-8">.data

---------- coverage: platform darwin, python 2.7.16-final-0 ----------
Coverage XML written to file coverage.xml

============================================================================ 1 failed, 2 passed, 1056 deselected in 4.65 seconds ============================================================================

Test result for 3.6.4: expected pass for the original #5264 fix, but unexpected failure for manytomany search:

tests/test_filters_multiple.py .F.                                                                                                                                                                    [100%]

================================================================================================= FAILURES ==================================================================================================
____________________________________________________________________ SearchFilterToManyTests.test_multiple_filter_conditions_manytomany _____________________________________________________________________
tests/test_filters_multiple.py:111: in test_multiple_filter_conditions_manytomany
    assert len(response.data) == 1
E   AssertionError: assert 0 == 1
E    +  where 0 = len([])
E    +    where [] = <Response status_code=200, "text/html; charset=utf-8">.data

---------- coverage: platform darwin, python 2.7.16-final-0 ----------
Coverage XML written to file coverage.xml

============================================================================ 1 failed, 2 passed, 1086 deselected in 4.73 seconds ============================================================================

This failure persists all the way up to current master.

rpkilby · 2019-12-17T21:15:04Z

Thanks for putting this together. Looking into the issue, I don't think there's a bug. Rather, the previous behavior was incorrect, but gave you your desired result. Admittedly though, it is a little unintuitive for the m2m case.

Ignoring the m2m component for a second, let's say SearchEntryRecordListView filtered on headline instead of authors__name. Given your test data, ?search=Lennon,unrelated would return 0 results, not 4. This is because the headline would need to contain both Lennon and unrelated, not just one or the other.

For the same reason the m2m search is also failing. You're not asking for entries who have separate authors named either Alice or Bob, you're asking for entries whose author name contains both Alice and Bob. e.g., this would match an author named "Bobby Alice".

This makes more sense if you think about searching against both name and age. For example ?search=Bob,30 would search for entries with 30 year old authors named Bob. In v3.6.3, this would have returned entries who have an author named Bob as well as some other 30 year old author.

So generally, more search terms increases specificity, not breadth.

Pomax · 2019-12-17T22:30:34Z

Then how would one go about matching things the intuitive way, based on the description from the documentation?

By default, searches will use case-insensitive partial matches. The search parameter may contain multiple search terms, which should be whitespace and/or comma separated. If multiple search terms are used then objects will be returned in the list only if all the provided terms are matched.

That does not cover how to deal with the ManyToMany case, which is pretty important to cover: authors__name with multiple terms would suggest that each term provided has to match "an author", not that all the terms have to match in a single author (which is counterintuitive to say the least).

rpkilby · 2019-12-18T01:57:16Z

The core issue is that there are multiple ways to query against a to-many relationship. No one way is always correct, but we have to figure out what makes sense for the general case.

Let's use the Blog/Entry models from those Django docs. Ignoring the to-many relationship for now, let's say I have an EntryView setup with search_fields = ['headline', 'pub_date']. If I were to search /entries?search=Lennon,2008, I'd expect the results to contain entries about Lennon that were published in 2008. I wouldn't want my result set to contain entries about Lennon combined with unrelated entries published in 2008 (I realize that the search would match a headline of "Lennon wore a hat in 2008". For the sake of argument, let's assume there are no headlines containing dates).

Now, let's extend that across the Blog relationship. If I have a similar BlogView with search_fields = ['entry__headline', 'entry__pub_date'], I'd expect /blogs?search=Lennon,2008 to give me the same results, but for the related set of blogs. The current implementation does this, while v3.6.3 instead returns blogs that have an entry about Lennon, as well as an entry published in 2008 (about Lennon or otherwise).

Back to the Author/Entry models, the difference is that we're only searching one field, the author name. Given that each item in the result set must match all terms, /authors?search=bob,alice would be nonsensical, since an author can't both be named Bob and Alice (ignoring "Bobby Alice"). Extending that query across the Entry relationship doesn't change that, which is why no results are returned.

Also, as best I can tell, Django admin's search_fields roughly works the same way. Would be worth it to test.

rpkilby · 2019-12-18T02:39:59Z

As to fixing this.. if you're using postgres, you should be able to annotate the concatenated set of author names with ArrayAgg or StringAgg, searching against that. Something like,

Entry.objects \
    .annotate(author_names=StringAgg('authors__name') \
    .filter(author_names__icontains='bob', author_names__icontains='alice')

Pomax · 2019-12-18T17:01:40Z

The problem here is that the docs you link to talk about specifically using filter, which is not what search is for: https://docs.djangoproject.com/en/3.0/topics/db/queries/#spanning-multi-valued-relationships is talking about the kind of thing that you would use title=lennon&year=1979 for when you're using Django Rest Framework. You're not going to use the general purposes search if what you actually want is to have lennon only ever match inside a title, and 1979 only ever match inside a pubdate. If you use search and you get a result that's from 2003 but has "John Lennon; the 1979 report" as title then that is in fact entirely correct and expected, based on the documentation for search_field with multiple terms, taken together with the fact that DRF distinguishes between "filtering" and "searching" through distinct filter_fields and search_fields. If you use search, you are explicitly saying "I am not looking for matches in specific fields, I'm just looking for anything that can match the intersection of these terms, based on any of the search_fields values I've specified".

(And of course, as you point out: getting results mixed in with entries that only match "lennon" or only match 1979, would be entirely wrong)

So a strong clue for the fact that this was an unplanned change, and 3.6.3 was doing exactly what it should have been doing is that there's nothing in the changelog or release notes for 3.6.4 that this behaviour was intentionally changed, despite breaking backward compatiblity, which this test makes abundantly clear.

On the other hand, if this breaking change was expected and desired then the documentation needs to be updated to explain this, and the release notes for 3.6.4 need to be ammended for anyone who will be doing an uplift across 3.6.3/3.6.4 because a ManyToMany relation is fundamentally different from a plain field or ForeignKey, and you can't have a patch version that breaks someone's API, with docs that describe what should happen while the underlying code does something else =)

(I'd much rather see the original functionality restored, of course, because all evidence points to this being an unplanned but breaking change, so should absolutely be fixed, but I'll take better documentation that explains why using search_fields with ManyToMany field with not do what you'd expect it to do, because you'd expect separate terms to be matched in the ManyToMany set, not "all terms need to be matched inside a single value inside that set". That's just plain weird)

It might make sense to get an extra pair of eyes on this PR, because it's possible that the change was entirely unplanned, and in retrospect should have had this test to ensure that the ManyToMany behaviour didn't change without explicitly meaning to, without anyone discovering that until December 16, 20919, with separate work to change the behaviour with respect to ManyToMany relations, tied to its own issue number, with the associated docs updates.

rpkilby · 2019-12-18T20:43:48Z

The problem here is that the docs you link to talk about specifically using filter, which is not what search is for

Both SearchFilter and Django admin's search_fields use filter under the hood, which is why it's relevant.

You're not going to use the general purposes search if what you actually want is to have lennon only ever match inside a title, and 1979 only ever match inside a pubdate.

Right, I was using a simplified description for expediency, as that's what the example query will typically reduce to. I acknowledge that the search query is more complex:

I realize that the search would match a headline of "Lennon wore a hat in 2008". For the sake of argument, let's assume there are no headlines containing dates

It's the same idea as your "John Lennon; the 1979 report" example.

So a strong clue for the fact that this was an unplanned change, and 3.6.3 was doing exactly what it should have been doing is that there's nothing in the changelog or release notes for 3.6.4 that this behaviour was intentionally changed, despite breaking backward compatiblity, which this test makes abundantly clear.

The change in behavior was discussed in the original ticket (#4655 (comment) and #4655 (comment)), but my PR failed to mention it, so wasn't included in the release notes. Definitely an oversight, but the changes were expected.

and you can't have a patch version that breaks someone's API

I don't want to dive into the timeline, but you can look at the timing of comments and events on the original issue & PR to understand how it was added to a patch release, but this also boils down to an oversight. Either way you're correct, and we typically try to avoid adding breaking changes in patch releases.

Pomax · 2019-12-20T17:59:41Z

The change in behavior was discussed in the original ticket (#4655 (comment) and #4655 (comment)), but my PR failed to mention it, so wasn't included in the release notes. Definitely an oversight, but the changes were expected.

Ah, in that case, can the page with the 3.6.* release information still be edited to mention this, with ideally a bit of text that explains how to "fix" ones API when uplifting past 3.6.3, so future folks who get put on uplifting old API servers have a path forward?

(And probably add a bit to the "search_fields with multiple terms" section of the DRF filtering docs that explains in detail what to expect with manytomany and why, with a link that explains how to do what one might intuitively expect to happen instead?)

rpkilby · 2019-12-21T00:24:42Z

I just noticed that the 3.6.4 release notes do mention the behavior change, although it is terse.

Fix SearchFilter to-many behavior/performance. #5264

We can expand the note, but ideally it should remain somewhat terse, and additional explanation can be added to the original PR description. Happy to review a PR on this.

And probably add a bit to the "search_fields with multiple terms" section of the DRF filtering docs that explains in detail what to expect with manytomany and why

Feel free to open a PR on this too. The subject is complicated though, and we probably don't want to insert a lengthy explanation. Probably more of a general warning of "m2m might not behave as you expect. Make sure you understand queries that span to-many relationships and test expectations etc.."

with a link that explains how to do what one might intuitively expect to happen instead?

Again, I'd reiterate the point that our behavior matches the behavior of the Django admin (I should double check, but a quick look at the code indicates that this is correct). I'm also hesitant to recommend any solutions. The ArrayAgg/StringAgg suggestion is specific to postgres, and reverting the behavior change would give the desired result for the multiple tag names example, but give incorrect results for others. Namely,

(And of course, as you point out: getting results mixed in with entries that only match "lennon" or only match 1979, would be entirely wrong)

Pomax · 2019-12-22T18:25:32Z

I won't be opening a PR, because I clearly don't know enough about DRF here, nor its style of documentation, nor its conventions around what to include or exclude. All I can tell you is that as a user of DRF, the documentation over on https://www.django-rest-framework.org/community/release-notes/#36x-series needs something that mentions that 3.6.4 is a breaking update, and what that breaks. If that means an update to the PR comment of #5264 then I will leave it to you to summarize that, as you clearly know far more about this than I do. Or whoever wrote the original PR even.

And it might even make sense to land this PR as well, but after changing the assert for manytomany to 0, so that it passes. That way, now there's a test in place to catch regressions in future versions of DRF. Right now there simply is no test that asserts what the behaviour is supposed to be, which also means people can't find code that confirms whether it should be 0 or 2.

rpkilby · 2020-01-06T21:47:04Z

Thanks for the reply @Pomax. Reopening to track the issues:

Release notes need to more clearly point out the breaking change
Expected m2m behavior should be tested
We might also want to add that vague note pointing users to Django's to-many queries docs.

Pomax · 2020-02-04T00:04:15Z

Would it make sense to file that as a new issue? Then I can update this PR for you so that it passes based on current behaviour, and get checkbox #2 ticket effectively "for free" =)

(with the added benefit of not using a PR as a tracking issue. PRs close when they get merged in ;)

Pomax · 2020-05-27T18:06:21Z

closing this PR in favour of #7351, so it stops showing up in my outstanding PR list.

multiple filter conditions for ManyToMany relations yield an incorrec…

14e8398

…t result

This was referenced Dec 16, 2019

Fix SearchFilter to-many behavior/performance #5264

Merged

[WIP] DRF uplift mozilla/network-pulse-api#585

Closed

relocated multiple condition filter tests to their own file

ba34e6c

rpkilby closed this Dec 17, 2019

rpkilby changed the title ~~multiple filter conditions for ManyToMany relations yield an incorrect result~~ Doc/test SearchFilter m2m behavior Jan 6, 2020

rpkilby reopened this Jan 6, 2020

Pomax mentioned this pull request Feb 4, 2020

uplift DRF to the latest version (epic) mozilla/network-pulse-api#544

Closed

3 tasks

Pomax mentioned this pull request Feb 20, 2020

Fix how we implement ?search=... filtering for multiple tags mozilla/network-pulse-api#611

Closed

Pomax mentioned this pull request May 27, 2020

Document and add test for SearchFilter m2m behavior #7351

Closed

3 tasks

Pomax closed this May 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doc/test SearchFilter m2m behavior #7094

Doc/test SearchFilter m2m behavior #7094

Pomax commented Dec 16, 2019 •

edited

Loading

rpkilby commented Dec 17, 2019

Pomax commented Dec 17, 2019

Pomax commented Dec 17, 2019 •

edited

Loading

rpkilby commented Dec 17, 2019

Pomax commented Dec 17, 2019 •

edited

Loading

rpkilby commented Dec 18, 2019

rpkilby commented Dec 18, 2019

Pomax commented Dec 18, 2019 •

edited

Loading

rpkilby commented Dec 18, 2019

Pomax commented Dec 20, 2019 •

edited

Loading

rpkilby commented Dec 21, 2019

Pomax commented Dec 22, 2019 •

edited

Loading

rpkilby commented Jan 6, 2020

Pomax commented Feb 4, 2020 •

edited

Loading

Pomax commented May 27, 2020

Doc/test SearchFilter m2m behavior #7094

Doc/test SearchFilter m2m behavior #7094

Conversation

Pomax commented Dec 16, 2019 • edited Loading

rpkilby commented Dec 17, 2019

Pomax commented Dec 17, 2019

Pomax commented Dec 17, 2019 • edited Loading

rpkilby commented Dec 17, 2019

Pomax commented Dec 17, 2019 • edited Loading

rpkilby commented Dec 18, 2019

rpkilby commented Dec 18, 2019

Pomax commented Dec 18, 2019 • edited Loading

rpkilby commented Dec 18, 2019

Pomax commented Dec 20, 2019 • edited Loading

rpkilby commented Dec 21, 2019

Pomax commented Dec 22, 2019 • edited Loading

rpkilby commented Jan 6, 2020

Pomax commented Feb 4, 2020 • edited Loading

Pomax commented May 27, 2020

Pomax commented Dec 16, 2019 •

edited

Loading

Pomax commented Dec 17, 2019 •

edited

Loading

Pomax commented Dec 17, 2019 •

edited

Loading

Pomax commented Dec 18, 2019 •

edited

Loading

Pomax commented Dec 20, 2019 •

edited

Loading

Pomax commented Dec 22, 2019 •

edited

Loading

Pomax commented Feb 4, 2020 •

edited

Loading