Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query explosion even on simple models with no included serializers #337

Closed
aidanlister opened this issue Apr 14, 2017 · 79 comments
Closed
Assignees

Comments

@aidanlister
Copy link
Contributor

I love the idea of this module but it seems like it's impossible to build fast APIs with it.

Let's say I have Books and Authors. I fetch a list of books, even with no "included" specified I get a query for each author, just to display the FK?

As a broader issue, as Author increases in complexity developers will add more included_serializers to that serialiser, which inadvertently affects my Books viewset with a query explosion.

Could we define a lambda for each included_serializer to alter the query? Anything more robust than magic.

@mblayman
Copy link
Collaborator

@aidanlister thanks for your concern about performance. As a production user of DJA, I want my site to be performant too. The built-in DJA ModelViewSet uses a prefetch_related on querysets to avoid an N+1 query bug when including additional models. Do you have a specific example and/or data to explain the "query explosion" that you're referring to?

FYI, there was a recent bug that was fixed (which I don't think has been released yet) that was preventing the prefetch_related code from executing. Ideally, a DJA model viewset should only have to do 1 db query for each included model type. If that's not the behavior you're seeing in practice, then maybe there is another bug.

@aidanlister
Copy link
Contributor Author

aidanlister commented Apr 14, 2017

@mblayman so load up the example app, generate some fake data:

factories.CommentFactory.bulk_create(50)

Then look at the comments list:

screen shot 2017-04-15 at 9 51 34 am

We can see there's 5 comments, but 15 queries, and no included relationships ... it's doing a query just to display the FK.

This is the latest develop code.

@aidanlister
Copy link
Contributor Author

aidanlister commented Apr 15, 2017

I've added a failing test,
uptick@66609ae

@mblayman
Copy link
Collaborator

Thanks @aidanlister, that's really helpful info. I agree that seems wrong for sure.

It seems like the next steps are to track down those excessive queries and find a way to delay evaluation or group things together to reduce the number to where it should be.

@aidanlister
Copy link
Contributor Author

@mblayman I don't think it's about delaying the evaluation, we don't need a query to just show the FK.

I'll add a series of performance test cases to properly test this, but have you thought about providing a way of specifying the select related / prefetch related via a lambda or method on the serializer? Forcing it to be explicit seems better than magic.

@mblayman
Copy link
Collaborator

mblayman commented Apr 15, 2017

Truthfully, I haven't given it a lot of thought. I stepped into a maintainer mode less than a couple of months ago. I'm not an expert in all the internal workings of the project, though I would like to understand it better.

Explicit over implicit definitely fits the Python ethos. I suspect that a reasonable default would be needed to maintain backwards compatibility. Adding a method to help control the select_related / prefetch_related aspects of the serializer sounds good.

I'm trying to fire up the example now so I can look at the queries and see where the library is being wasteful.

@mblayman
Copy link
Collaborator

Ok, I got the example app going and debug toolbar is showing me that each comment in the comment listing is adding two queries. This is definitely an N+1 bug because debug toolbar shows the extra queries come from fetching the author and entry from each comment to get the respective IDs. I think the problem is in the example app's viewset code.

With 4 comments, I had 12 queries. Some are unavoidable because of the need to get the count and use the paginator (as you noted in your test). 8 queries came from each of the 4 comments fetching author and entry.

I went to the CommentViewSet and changed the queryset:

    queryset = Comment.objects.all().select_related('author', 'entry')

After this change, the query count dropped to 5 and stayed there even when I added more comments.

I do see the possibility to optimize the model serializer to use entry_id instead of entry.id (for example). That would be a nice benefit to avoid going to the entry db table entirely since the ID is stored in the foreign key field of the comment model. I think this pattern could generalize.

@aidanlister
Copy link
Contributor Author

Ahhhh, yes you're right, it's the author.id that's causing it to blow out. If we can fix that, that solves a good chunk of my issue.

@aidanlister
Copy link
Contributor Author

Looks like line 120 in renderers.py,
resolved, relation_instance = utils.get_relation_instance(resource_instance, source, field.parent)

?

@aidanlister
Copy link
Contributor Author

Actually I take that back ... the renderer has no effect.

Line 143, Json API ModelSerializer has this:
serializer_related_field = ResourceRelatedField

Somewhere in ResourceRelatedField ...

@aidanlister
Copy link
Contributor Author

So, on ResourceRelatedField, we set use_pk_only_optimization = False,

class ResourceRelatedField():
    def use_pk_only_optimization(self):
        # We need the real object to determine its type...
        return False

This means that this method from the parent class (PrimaryKeyRelatedField) runs:

    def get_attribute(self, instance):
        if self.use_pk_only_optimization() and self.source_attrs:
            # Optimized case, return a mock object only containing the pk attribute.
            try:
                instance = get_attribute(instance, self.source_attrs[:-1])
                value = instance.serializable_value(self.source_attrs[-1])
                if is_simple_callable(value):
                    # Handle edge case where the relationship `source` argument
                    # points to a `get_relationship()` method on the model
                    value = value().pk
                return PKOnlyObject(pk=value)
            except AttributeError:
                pass

        # Standard case, return the object instance.
        return get_attribute(instance, self.source_attrs)

The optimized case is what stops the query explosion ordinarily, it creates a mock object (PKOnlyObject) with only the PK set so that there's no object lookup.

@jsenecal looks like you made the commit two years ago, any insight?

23d0af6

@aidanlister
Copy link
Contributor Author

I think I'm getting there ... this has resolved the number of queries, but I had to kill some functionality.

uptick@94f79c4

@aidanlister
Copy link
Contributor Author

I think I've gone as far as I can go with this ...

This is a simple pattern I've used for managing the queryset based on what is included, I think this makes sense:

class CommentViewSet(viewsets.ModelViewSet):
    queryset = Comment.objects.all()
    serializer_class = CommentSerializer

    def get_queryset(self):
        qs = super().get_queryset()
        if 'author' in self.request.GET.get('include', '').split(','):
            qs = qs.prefetch_related('author')
        return qs

The other location I mentioned before was also causing an explosion, the simplest fix is removing the get links functionality:

field_links = field.get_links(resource_instance)

@aidanlister
Copy link
Contributor Author

aidanlister commented Apr 15, 2017

This is where I've finished up:

https://github.com/django-json-api/django-rest-framework-json-api/compare/develop...ABASystems:hotfix/performance?expand=1

I've eliminated the query explosion for the simple case ... I've done this by switching off the links at the settings level, and by passing the type of the object around rather than expecting the full instance to be available.

Rather than disabling the links at the settings level we should be able to disable fetching them if the resource type isn't in included, but I didn't get that far.

On the more complicated example, comments?include=author, the bug manifests itself again with a query for each AuthorBio just to display the FK ...

For now, I'm giving up, I don't get the feeling that this is easily solvable.

@jerel @jsenecal any insight onto how I might fix this would be much appreciated.

@aidanlister
Copy link
Contributor Author

@mblayman have you had a chance to take a look at this yet?

@mblayman
Copy link
Collaborator

No, I stopped looking at it. I thought the conversation reached a conclusion for your case, and you were reaching out to the other devs. I don't see anything directly actionable with this issue. The theme of this issue seems to be "find a way to use fewer queries." While I totally agree with the goal, without a PR and some code that attempts to reduce the query count, I don't see what we could do next with this issue.

@aidanlister
Copy link
Contributor Author

@mblayman ah, I assumed when you said you were the maintainer that you meant the package was actively maintained.

The package has an m*(n+1) bug where m is the number of relations a model has, n is the page size, making it completely unsuitable for any production environment or large data sets.

So I guess I'm worried if the package isn't being actively maintained we might need to rethink our API strategy.

@mblayman
Copy link
Collaborator

As a community effort, this project depends on contributions from the community. It is actively maintained and had a release on Friday for 2.2 which includes many months of contributions from a wide variety of individuals.

I respectfully disagree with your conclusion that it is "completely unsuitable for any production environment or large data sets" (emphasis added). Aside from being hyperbolic (which I don't think is fruitful for technical conversation), I see this claim as false, knowing that I and many others have deployed DJA in a production setting.

IMO, the problem you are reporting is a limitation that is not unique to DJA. Selecting an appropriate queryset is true for DJA and DRF and Django underneath it. I think this is an area where Django (and Python in general) prefers being explicit over implicit. Rather than trying to intuit the appropriate selected_related additions, it is up to the developer to include this in the queryset. This strategy prevents code from over-selecting from other tables and bloating a db query. I thought I demonstrated that as I walked through how to eliminate the N+1 query bug from the example app. If I failed to make that clear enough, I apologize.

I think your comment from a few days ago (#337 (comment)) gets at the heart of the piece that can be solved by a PR. If someone could find a way to make sure the ID is fetched from model_id instead of model.id, that would be awesome. I'd be happy to review a PR fixing this issue, however, I don't have the time to fix this myself.

@jsenecal
Copy link
Member

jsenecal commented Apr 23, 2017 via email

@jsenecal
Copy link
Member

jsenecal commented Apr 23, 2017 via email

@aidanlister
Copy link
Contributor Author

aidanlister commented Apr 26, 2017

@jsenecal I'm not sure who's complaining about the package? Several of our developers have made contributions to this package over the last six months, I'm very glad it's available. In fact, our product depends on it.

That's why I'm pretty worried that a critical issue like this was commented on with "I don't see anything directly actionable with this issue." ... perhaps my comment was a bit hyperbolic, but if I re-state it as "completely unsuitable for any production environment with large data sets" you might understand where I'm coming from.

Here's a scenario:

If you have a single model, e.g. Book, which has four relations e.g. Author, Publisher, CopyrightHolder, Category.

To display 25 books in DRF without any includes, I would need a single query: SELECT * FROM book.

To display 25 books DRF-JSONAPI without any includes, I would need either:
a) 1 query ala SELECT * FROM books LEFT JOIN author LEFT JOIN publisher LEFT JOIN CopyrightHolder LEFT JOIN Category
b) 4 queries with prefetches.

Let's say I have 1M books, 50k authors, 10k categories, 10k copyrightholders. In the select_related scenario, you've just created a in-memory table with 1e18 rows ... do this a few times per second and you have melted your database. All to display 25 rows, with no included relationships. So select_related is only going to work if you have a small dataset or a small volume of traffic.

... so maybe we can do the prefetch option. Sure now we need 4x more queries, but at least they don't cause the database to melt down.

However, this doesn't scale with the app complexity. If you want to include the Author, Category and CopyrightHolder with the books, then you have to prefetch any relationships those models might have! Just to display the FK!

You start ending up writing your ViewSet's like this:

class BookViewSet(ModelViewSet):
    queryset = Book.objects.all().prefetch_related(
        'book__author__tags',
        'book__author__client__tags',
        'book__author__client__primary_contact',
        'book__author__client__billingcard',
        'book__author__client__account',
        'book__copyrightholder__client__account_manager',
        'book__copyrightholder__client__clientgroup__accounts',
        'book__copyrightholder__branch__groups',
        'book__copyrightholder__branch__user_set',
        'book__publisher__address',
        'book__publisher__servicecontract_set',
        'book__publisher__account_manager',
        'book__property__billingcard',
        'book__service_group',
        'book__type__standard',
        'book__category_set__type',
        'book__type__categorytype_set',
    )

This is not a contrived example. That's an actual ModelViewSet we have (though I had to slightly obfuscate the model names).

Once you have even a moderately complex app, it becomes impossible to stop query explosions as new relationships get added -- every model that might include that model's API needs to be updated.

So while I agree with you that my earlier comment was to some extent hyperbole, it depends on your definition of production, and if this isn't fixable then we're in deep shit and need to rethink our API strategy (which is of course, not your problem).

@aidanlister
Copy link
Contributor Author

@mblayman you talk about trying to serialize entry_id not entry.id but you need to understand how the serializer works, take a look at how DRF itself solves this issue see use_pk_only_optimization and PKOnlyObject.

@jsenecal have you got any time to help our team fix this? I've gone as far as I can. Surely somehow we can get the type from the relationship itself (not the model) and pass that to be used by the down stream serializer? I tried here... uptick@94f79c4

Then maybe we can avoid doing work on models that aren't included, e.g. get_links shouldn't be called all the way down the stack ... uptick@94f79c4#diff-e2e58e3e5c345fc3b14c160b3a00a64cL119

If that can't be done easily, would it be reasonable to create a fork of this package as a "fast mode" which aims to implement less of the spec, but does it in a reasonably performant way?

@mblayman
Copy link
Collaborator

mblayman commented Apr 26, 2017

First, please allow me to apologize. Sometimes text really sucks as a communication format. 😄 While I wrote "I don't see anything directly actionable with this issue," it would have been more accurate to write "I see a few performance problems here and I don't know which of them we can/should focus on specifically."

Second, what is the outcome that you're hoping for? It's pretty clear you're looking for a fix to these performance problems. You have provided a heroic amount of data and I appreciate that. I think the remaining problem is that you're hoping one of the maintainers will fix the performance issues. That's where I think this issue is crashing into the waves of open source. A quick peek at the contributor graph will reveal that the biggest contributors have not made code contributions in over 6 months (jerel's most recent stuff was doing the release work for 2.2). I was trying to be truthful calling the project "active." It is active in the sense that PRs get reviewed and merged and issues get commented on. It is not active in the sense that some person(s) is working on the reported issues with any regularity.

I think that leads us back to a pull request. From what you've written, it seems like your team is acutely feeling the pain of performance issues in your business. You've demonstrated really great knowledge of what the problem is and where it occurs. I will happily review any PRs that start to fix these performance issues. That's as much time as I can personally offer as this is a side project for me and no one is paying me to do any of this. 😄

Thanks for your efforts so far.

@jsenecal
Copy link
Member

jsenecal commented Apr 26, 2017 via email

@aidanlister
Copy link
Contributor Author

@mblayman my goals were:

a) have a maintainer see the issue and fix it
b) if there's no maintainer time then get a point in the right direction
c) failing that, ensure that if we do commit the time and money to fixing the issue that it's a sensible investment, e.g. that the package is maintained and the maintainers are reasonable.

We've decided to go ahead and put some more senior developer time to fixing the issue, but I would like to have one of our team be added as a maintainer at some point, either @bor3ham, @jarekwg, or @sksavant once you're comfortable with their contributions and maintainer style.

@sksavant will follow up on this issue if he's able to come up with a solution.

@mblayman
Copy link
Collaborator

mblayman commented May 2, 2017

Awesome! I'm excited to have other folks involved and look forward to any future contributions.

@emojiface-ops
Copy link

@aidanlister appreciate your work on this issue, curious if you've gotten traction fixing the performance problems.

@aidanlister
Copy link
Contributor Author

Yes, actually the work I've done in this thread covers most of the issue, we expect to have a new production version up in the next four weeks.

@emojiface-ops
Copy link

@aidanlister thanks for the quick response, I'll check out your fork. Are you planning to commit the "production version" on your fork?

@aidanlister
Copy link
Contributor Author

Yes, it should go through to our fork when it's finished ... and I'll submit a PR when we finish testing.

@aidanlister
Copy link
Contributor Author

Looks great! Let's get it merged into develop and I'll rebase my WIP to look at the two parts of the query explosion.

@aidanlister
Copy link
Contributor Author

@jsenecal it would be amusing if we both caused and then later reported this explosion ... that commit you identified came from @jarekwg and @bor3ham (both from Uptick/ABAS). Here's the repository that PR came from:

https://github.com/jarekwg/django-rest-framework-json-api/commits/optimize/relationship-extraction

jsenecal added a commit that referenced this issue Jul 22, 2017
@jsenecal
Copy link
Member

I will now sign off but we need to open a separate issue as this is not yet completely resolved (we still have n*relation query explosion if queryset does not use prefetch_related)

@aidanlister
Copy link
Contributor Author

👍

@aidanlister
Copy link
Contributor Author

Keep this issue open? The PR is good but doesn't actually address this yet :)

@jsenecal
Copy link
Member

I guess you are right - Lets keep track of the discussion here

@aidanlister
Copy link
Contributor Author

Here's my progress on the explosion now:
develop...ABASystems:wip/find-query-explosions

Note to anyone else playing along, you can run the project with django-admin.py shell --settings=example.settings.dev and run tests with py.test -s -k test_performance

This code gets the first test to pass (test_query_count_no_includes) which is great ... but breaks on a simple include (test_query_count_include_author) which I haven't been able to debug yet (seems to start from extract_included)

@xgilest
Copy link

xgilest commented Jul 24, 2017

Hi @mcanaves is working on a solution for the problem with the renderer quering to the database for checking the id of foreing keys exists.
As stated earlier this meant for us that the API was to slow to be operational, hopefully with the great effort done by the community latelly, this will change.

@aidanlister
Copy link
Contributor Author

How's everyone going on this? Harder than it looks huh!

@jsenecal
Copy link
Member

jsenecal commented Jul 29, 2017 via email

@gregorypease280
Copy link
Contributor

anyone still actively working on this?

@aidanlister
Copy link
Contributor Author

No, not as far as I am aware. I wasn't able to progress it nor was @sksavant ... it's just whack-a-mole pushing the problem down the line.

DRF / JSON API has gotten so slow for us we're starting to think about something like Go, https://github.com/bor3ham/reja

@santiavenda2
Copy link
Contributor

Hi guys,

Some time ago we did this PR trying to fix some of the problems that you mentioned during this conversation.

Let me know if the changes makes sense and if they are ready to be merged.

All tests are running fine and we have been using this version in production without having any kind of problems.

@mblayman
Copy link
Collaborator

Ack, my bad, @santiavenda2. I missed that the PR posted was related to this. I'll try to get it reviewed soon (hopefully this weekend if I have some free time). After all the discussion that happened here, I certainly did not intentionally overlook the PR. Thanks for the poke.

@aidanlister
Copy link
Contributor Author

@santiavenda2 neat, we will take a look too. Anything we should watch out for?

@santiavenda2
Copy link
Contributor

santiavenda2 commented Dec 14, 2017

@aidanlister yes, these are the main changes
https://github.com/django-json-api/django-rest-framework-json-api/pull/374/files#diff-e2e58e3e5c345fc3b14c160b3a00a64cR127
https://github.com/django-json-api/django-rest-framework-json-api/pull/374/files#diff-e2e58e3e5c345fc3b14c160b3a00a64cR259
https://github.com/django-json-api/django-rest-framework-json-api/pull/374/files#diff-8f2fdb5cb1d819dce410d6b74d0b77b6R172

We tried to make two improvements:

  • try to not fetch related objects when we already have the id of the related object and we know the object type.
  • also we avoid fetching related objects if the related object is None.

You can see these improvements in this viewset
As you can see, the select_related and some of the prefetch_for_includes are not needed any more

Other changes are just code quality improvements and fixes for code compatibility with old django versions.

@mblayman
Copy link
Collaborator

mblayman commented Jan 9, 2018

With the work done in #374, do we have enough in place to call this issue fixed? Are there areas where DJA still issues too many queries?

@sliverc
Copy link
Member

sliverc commented Jun 8, 2018

Query explosion on simple models without inclusion looks now fine for me. I do still see issues when using included serializers though.

views.AutoPrefetchMixin does the lookup for direct included relationships. Such a relationship can have reverse relationships though and each of them still execute a query. I currently use views.PrefetchForIncludesHelperMixin as a workaround but AutoPrefetchMixin should prefetch reverse included relationships automatically though.

@timcbaoth
Copy link
Contributor

I stumbled upon this thread when I noticed a query explosion for a project at work. I found that there was a bug in https://github.com/django-json-api/django-rest-framework-json-api/pull/374/files#diff-e2e58e3e5c345fc3b14c160b3a00a64cR127 which resulted in the optimization of not querying forward relations not being applied.

I will provide a PR shortly, if that is OK.

@sliverc
Copy link
Member

sliverc commented Apr 18, 2021

Most of the concern in this PR actually have been addressed. There are still ways to reduce number of queries. Currently in this thread it is not clear what more needs to be done though.

So closing this issue and forward the conversation to discussion #921 and encourage everyone with an idea on reducing number of queries to engage there. Out of that discussion concrete new issues can then be extracted to work on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants