RFC: Only generate ModelSerializer fields once #7093

bluetech · 2019-12-16T10:37:58Z

Problem

We have several large projects which use DRF serializers. ModelSerializers are a great feature which saves a lot of time and mistakes. However, as the projects grow and the models grow, the slowness of ModelSerializer starts to show. The usual solution for this problem is "use something else", but we prefer to try and improve ModelSerializer's performance first, if possible.

The following is a somewhat realistic benchmark, comparing serialization of one of our models using a ModelSerializer, a regular serializer (it's just the str(ModelSerializer())), and a custom serializer.

Benchmark

#!/usr/bin/env python
import cProfile
import datetime
from time import perf_counter

import django
from django.db import models
from django.conf import settings
from django.utils.translation import gettext_lazy as _
from rest_framework import serializers


settings.configure(INSTALLED_APPS=['rest_framework'])
django.setup()


class SomeOtherModel(models.Model):
    class Meta:
        app_label = 'some_app'
        verbose_name = _('SomeOtherModel')
        verbose_name_plural = _('SomeOtherModels')

    id = models.AutoField(
        primary_key=True,
        verbose_name=_('ID')
    )


class SomeModel(models.Model):
    class Meta:
        app_label = 'some_app'
        verbose_name = _('SomeModel')
        verbose_name_plural = _('SomeModels')

    id = models.AutoField(
        primary_key=True,
        verbose_name=_('ID')
    )
    field1 = models.DateTimeField(
        editable=False,
        db_index=True,
        verbose_name=_('Field1'),
    )
    field2 = models.DateTimeField(
        verbose_name=_('Field2'),
    )
    field3 = models.ForeignKey(
        to=SomeOtherModel,
        on_delete=models.PROTECT,
        verbose_name=_('Field3'),
    )
    field4 = models.ForeignKey(
        to=SomeOtherModel,
        on_delete=models.PROTECT,
        verbose_name=_('Field4'),
    )
    field5 = models.DateTimeField(
        blank=True,
        null=True,
        verbose_name=_('Field5'),
    )
    field6 = models.FloatField(
        blank=True,
        null=True,
        verbose_name=_('Field6'),
    )
    field7 = models.FloatField(
        blank=True,
        null=True,
        verbose_name=_('Field7'),
    )
    field8 = models.DateTimeField(
        blank=True,
        null=True,
        verbose_name=_('Field8'),
    )
    field9 = models.FloatField(
        blank=True,
        null=True,
        verbose_name=_('Field9'),
    )
    field10 = models.CharField(
        max_length=50,
        blank=True,
        null=True,
        verbose_name=_('Field10'),
    )
    field11 = models.IntegerField(
        choices=(
            (1, _('Choice1')),
            (2, _('Choice2')),
        ),
        verbose_name=_('Field11'),
    )
    field12 = models.BigIntegerField(
        verbose_name=_('Field12'),
    )
    field13 = models.ForeignKey(
        to=SomeOtherModel,
        blank=True,
        null=True,
        db_index=False,
        on_delete=models.PROTECT,
        verbose_name=_('Field13'),
    )
    field14 = models.BooleanField(
        blank=True,
        null=True,
        verbose_name=_('Field14'),
    )
    field15 = models.BooleanField(
        verbose_name=_('Field15'),
        null=True,
    )
    field16 = models.BooleanField(
        verbose_name=_('Field16'),
        help_text=_('Help'),
        null=True,
    )
    field17 = models.ForeignKey(
        to=SomeOtherModel,
        null=True,
        blank=True,
        db_index=False,
        on_delete=models.PROTECT,
        verbose_name=_('Field17'),
    )
    field18 = models.GenericIPAddressField(
        verbose_name=_('Field18'),
    )
    field19 = models.ForeignKey(
        to=SomeOtherModel,
        blank=True,
        null=True,
        db_index=False,
        on_delete=models.PROTECT,
        verbose_name=_('Field19'),
    )
    field20 = models.ImageField(
        blank=True,
        null=True,
        verbose_name=_('Field20'),
    )
    field21 = models.ForeignKey(
        to=SomeOtherModel,
        blank=True,
        null=True,
        on_delete=models.PROTECT,
        verbose_name=_('Field21'),
    )
    field22 = models.PositiveSmallIntegerField(
        blank=True,
        null=True,
        verbose_name=_('Field22'),
        help_text=_('Help'),
    )


class ModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = SomeModel
        fields = (
            'id',
            'field1',
            'field2',
            'field3',
            'field4',
            'field5',
            'field6',
            'field7',
            'field8',
            'field9',
            'field10',
            'field11',
            'field12',
            'field13',
            'field14',
            'field15',
            'field16',
            'field17',
            'field18',
            'field19',
            'field20',
            'field21',
            'field22',
        )


class RegularSerializer(serializers.Serializer):
    id = serializers.IntegerField(label='ID', read_only=True)
    field1 = serializers.DateTimeField(read_only=True)
    field2 = serializers.DateTimeField()
    field3 = serializers.PrimaryKeyRelatedField(queryset=SomeOtherModel.objects.all())
    field4 = serializers.PrimaryKeyRelatedField(queryset=SomeOtherModel.objects.all())
    field5 = serializers.DateTimeField(allow_null=True, required=False)
    field6 = serializers.FloatField(allow_null=True, required=False)
    field7 = serializers.FloatField(allow_null=True, required=False)
    field8 = serializers.DateTimeField(allow_null=True, required=False)
    field9 = serializers.FloatField(allow_null=True, required=False)
    field10 = serializers.CharField(allow_blank=True, allow_null=True, max_length=50, required=False)
    field11 = serializers.ChoiceField(choices=((1, 'Choice1'), (2, 'Choice2')), validators=[django.core.validators.MinValueValidator(1), django.core.validators.MaxValueValidator(2)])
    field12 = serializers.IntegerField(max_value=9223372036854775807, min_value=-9223372036854775808)
    field13 = serializers.PrimaryKeyRelatedField(allow_null=True, queryset=SomeOtherModel.objects.all(), required=False)
    field14 = serializers.BooleanField(allow_null=True, required=False)
    field15 = serializers.BooleanField(allow_null=True, required=False)
    field16 = serializers.BooleanField(allow_null=True, help_text='Help', required=False)
    field17 = serializers.PrimaryKeyRelatedField(allow_null=True, queryset=SomeOtherModel.objects.all(), required=False)
    field18 = serializers.IPAddressField()
    field19 = serializers.PrimaryKeyRelatedField(allow_null=True, queryset=SomeOtherModel.objects.all(), required=False)
    field20 = serializers.ImageField(allow_null=True, max_length=100, required=False)
    field21 = serializers.PrimaryKeyRelatedField(allow_null=True, queryset=SomeOtherModel.objects.all(), required=False)
    field22 = serializers.IntegerField(allow_null=True, help_text='Help', max_value=32767, min_value=0, required=False)


def custom_serializer(obj):
    return {
        'field1': obj.field1.isoformat(),
        'field2': obj.field2.isoformat(),
        'field3': obj.field3_id,
        'field4': obj.field4_id,
        'field5': obj.field5.isoformat(),
        'field6': obj.field6,
        'field7': obj.field7,
        'field8': obj.field8.isoformat(),
        'field9': obj.field9,
        'field10': obj.field10,
        'field11': obj.field11,
        'field12': obj.field12,
        'field13': obj.field13_id,
        'field14': obj.field14,
        'field15': obj.field15,
        'field16': obj.field16,
        'field17': obj.field17_id,
        'field18': obj.field18,
        'field19': obj.field19_id,
        'field20': obj.field20,
        'field21': obj.field21_id,
        'field22': obj.field22,
    }


dt = datetime.datetime(2019, 12, 16, 11, 30, 0, 0, datetime.timezone.utc)
instance = SomeModel(
    field1 = dt,
    field2 = dt,
    field3_id = 1000,
    field4_id = 1000,
    field5 = dt,
    field6 = 10.20,
    field7 = 10.20,
    field8 = dt,
    field9 = 10.20,
    field10 = 'abcde',
    field11 = 1,
    field12 = 10000,
    field13_id = 1000,
    field14 = True,
    field15 = False,
    field16 = True,
    field17_id = 1000,
    field18 = '10.20.30.40',
    field19_id = 1000,
    field20 = None,
    field21_id = 1000,
    field22 = 10,
)


def run_model():
    ModelSerializer(instance).data


def run_regular():
    RegularSerializer(instance).data


def run_custom():
    custom_serializer(instance)


# cProfile.run('''''', sort='tottime')
start = perf_counter()
for i in range(5000): run_model()
print('model:', perf_counter() - start)
start = perf_counter()
for i in range(5000): run_regular()
print('regular:', perf_counter() - start)
start = perf_counter()
for i in range(5000): run_custom()
print('custom:', perf_counter() - start)

The results (before this PR):

model:   8.9562s
regular: 3.8447s
custom:  0.0409s

From the profile, it is clear that the difference between model and regular is due to ModelSerializer re-generating the model fields each time it is instantiated.

Solution

This PR only generates the fields once, when the class is defined, using a metaclass. With this change, the benchmark results are

model:   3.0143s
regular: 3.8290s
custom:  0.0403s

The PR contains breaking changes, in that all of the codes participating in generating the fields now becomes classmethods/class attributes instead of self attributes/methods. This is necessary in order to indicate that the code only runs one, and must not use anything from self because it is incidental. Specifically, the fields must be defined on the class now (they almost always are, already):

url_field_name
serializer_field_mapping
serializer_related_field
serializer_choice_field
serializer_url_field

and the following methods become classmethods (breaking for ModelSerializer subclasses which override them):

get_field_names
get_default_field_names
build_field
build_standard_field
build_relational_field
build_nested_field
build_property_field
build_url_field
build_unknown_field
include_extra_kwargs
get_extra_kwargs
get_uniqueness_extra_kwargs

Additionally, errors in the ModelSerializer definition (like e.g. not setting fields = [...]) are now raised on definition time, not on first instantiation.

The DRF tests pass without any changes, except for adapting to when the errors are raised.

Possible further work

On the deserialization side, I believe the validators are generated every time. They can be cached too.

From profiling the benchmark after the changes, the major remaining slowdown is the copy.deepcopy of the fields on each instantiation. For reference, changing to a shallow copy (of each field) instead of a deepcopy brings the model time to 1.3842s (was attempted already before, #4587). The optimal solution would be to make the fields immutable, thus not requiring copy at all -- but that is a larger breaking change presumably.

auvipy · 2019-12-16T19:40:42Z

thanks for tackling this!

rpkilby

I'm generally in favor of this PR, left a few comments below.

I think the only major concern is if there are any common use cases that rely on any of the serializer instance attributes. e.g., depending on partial or the context.

rest_framework/serializers.py

rpkilby · 2019-12-16T19:54:32Z

rest_framework/serializers.py

    # Methods for determining the set of field names to include...

-    def get_field_names(self, declared_fields, info):
+    @classmethod
+    def get_field_names(cls, declared_fields, info):


Given that these methods are part of the public API, we can expect that users have overridden them as instance methods. Since we're transitioning to classmethods, we should provide a user-friendly warning informing them of the change. Should be fairly straightforward as:

# this will fail if `get_field_names` is an instance method. assert inspect.ismethod(cls.get_field_names), "<helpful message>" # or assert inspect.ismethod(type(self).get_field_names), "<helpful message>"

This should probably be raised sometime during class creation (possibly the aforementioned metaclass).

rest_framework/serializers.py

… base_fields

bluetech · 2019-12-17T09:39:48Z

Thanks for the comments @rpkilby.

I remembered from a previous PR that DRF has a requirement to work even before Django is initialized (checked by tests/importable), that's why I set up the cache on the first instantiation. But now I figure that ModelSerializers are exempt from that, because Models themselves cannot be defined before django is initialized. So I changed to use a metaclass. Now the fields are generated when the ModelSerializer is defined, rather than when it's first instantiated. This also means definition errors are raised immediately, so the tests do need some adjustments - I do think it's the better behavior though.

I changed the name to base_fields. Do note that SerializerMetaclass uses _declared_fields for the declared fields, so it's a bit inconsistent - maybe we want _fields for this?

Regarding the compat concerns, I left it out for now, to get feedback on the "end state" first. But if we agree on this approach, I'll come up with a proposal for the transition. I think we'll want at least one deprecation cycle where the self methods still work.

rest_framework/serializers.py

rpkilby · 2019-12-17T21:50:06Z

I remembered from a previous PR that DRF has a requirement to work even before Django is initialized (checked by tests/importable), that's why I set up the cache on the first instantiation. But now I figure that ModelSerializers are exempt from that, because Models themselves cannot be defined before django is initialized.

One quick comment - DRF itself needs to be importable, however this does not extend to user-defined serializers. Building fields would only happen in user-defined serializer classes, so these changes don't affect DRF importability.

bluetech · 2019-12-20T14:19:25Z

On the deserialization side, I believe the validators are generated every time. They can be cached too.

Here is a commit which does this (not for this PR): bluetech@067147c

rsiemens · 2020-03-05T01:36:58Z

@rpkilby @bluetech is there anything that needs to be done to move this PR forward? Is there anything I can do to help it along? Thanks for all the work on this!

rpkilby · 2020-03-06T02:54:31Z

Hi @rsiemens. At this point, it's an issue of bandwidth of the maintenance team. While the changes aren't complicated, they are non-trivial (the public-facing serializer API has been updated). We need to consider things like deprecations for users who have overridden these methods, etc.

I'm adding to the 3.12 milestone to help ensure this is looked at, but no guarantees.

rsiemens · 2020-03-06T15:33:45Z

Thanks for the update! Happy to lend a hand if needed.

tomchristie · 2020-03-09T10:30:11Z

My concern would be that the change footprint may be too big on this. It looks pretty risky.
I can easily see this introducing unexpected breakages, or interacting in an unexpected way with some user code that'd already overridden some field generating behavior.

bluetech · 2020-03-11T14:23:04Z

This definitely carries some breakage risk. I figure such risk is not acceptable at this stage of DRF's life. So I'm going to close this now -- thanks for considering!

FWIW, we use an internal fork of DRF with this patch and some others to improve performance, though it also removes some features we don't use which got in the way. My plan is to make the field instances immutable which would also remove the copy.deepcopy overhead, however I haven't got around to that yet. After that, I expect DRF model serializers to be quite speedy.

rpkilby · 2020-03-25T20:29:55Z

or interacting in an unexpected way with some user code that'd already overridden some field generating behavior.

I think there are generally two cases to consider:

Serializers that have overridden the existing instance methods.
Serializers that have overridden the methods, but also depend on instance attributes.

For the former, I'm not too concerned. We can detect whether the method is a class or an instance method and raise a warning notifying the user of the change. In a lot of these cases, they probably just need to wrap the method in @classmethod. The only downside is that I don't think there's a clean deprecation path *. We'd just need to loudly raise a helpful error message.

For the latter, there isn't really a good option. If their code is dependent on instance variables, then there isn't an easy migration path to the new classmethods. That said not sure how common this is, and per-instance field modifications could probably be moved to another serializer method.

* I actually do have an idea for a clean deprecation path, but whether or not it's a good idea remains to be seen.

mohmyo · 2020-07-09T08:30:28Z

Serialization performance in general and for ModelSerializer specially has been always unsatisfying for a long time, I think going down this road is an important and a great step even if the road wasn't the best one, I think the most would accept a little bit of small road bumps for such a great addition.

carltongibson · 2020-07-09T08:57:32Z

The low risk approach here is to begin this as a third-party package. from speedy import serializers and off you go.
There shouldn't be any reason why this wouldn't be feasible.

Once it's shown to work/be stable there's a case to be made for a change rest_framework but...

This definitely carries some breakage risk. I figure such risk is not acceptable at this stage of DRF's life.

I think that's the key right? (We really can't just break things...)

Only generate ModelSerializer fields once

c2843d0

rpkilby reviewed Dec 16, 2019

View reviewed changes

Use a metaclass, generate fields on definition time, _fields_cache ->…

2f5186b

… base_fields

bluetech commented Dec 17, 2019

View reviewed changes

rest_framework/serializers.py Outdated Show resolved Hide resolved

Handle missing Meta properly

edc05d0

rpkilby added this to the 3.12 Release milestone Mar 6, 2020

bluetech closed this Mar 11, 2020

oxan mentioned this pull request Jan 5, 2021

How about performance ? oxan/djangorestframework-dataclasses#37

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Only generate ModelSerializer fields once #7093

RFC: Only generate ModelSerializer fields once #7093

bluetech commented Dec 16, 2019 •

edited

Loading

auvipy commented Dec 16, 2019

rpkilby left a comment •

edited

Loading

rpkilby Dec 16, 2019

bluetech commented Dec 17, 2019

rpkilby commented Dec 17, 2019

bluetech commented Dec 20, 2019

rsiemens commented Mar 5, 2020

rpkilby commented Mar 6, 2020

rsiemens commented Mar 6, 2020

tomchristie commented Mar 9, 2020

bluetech commented Mar 11, 2020

rpkilby commented Mar 25, 2020

mohmyo commented Jul 9, 2020

carltongibson commented Jul 9, 2020

RFC: Only generate ModelSerializer fields once #7093

RFC: Only generate ModelSerializer fields once #7093

Conversation

bluetech commented Dec 16, 2019 • edited Loading

Problem

Solution

Possible further work

auvipy commented Dec 16, 2019

rpkilby left a comment • edited Loading

Choose a reason for hiding this comment

rpkilby Dec 16, 2019

Choose a reason for hiding this comment

bluetech commented Dec 17, 2019

rpkilby commented Dec 17, 2019

bluetech commented Dec 20, 2019

rsiemens commented Mar 5, 2020

rpkilby commented Mar 6, 2020

rsiemens commented Mar 6, 2020

tomchristie commented Mar 9, 2020

bluetech commented Mar 11, 2020

rpkilby commented Mar 25, 2020

mohmyo commented Jul 9, 2020

carltongibson commented Jul 9, 2020

bluetech commented Dec 16, 2019 •

edited

Loading

rpkilby left a comment •

edited

Loading