Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Only generate ModelSerializer fields once #7093

Closed
wants to merge 3 commits into from

Conversation

bluetech
Copy link
Contributor

@bluetech bluetech commented Dec 16, 2019

Problem

References #2504, #5614.

We have several large projects which use DRF serializers. ModelSerializers are a great feature which saves a lot of time and mistakes. However, as the projects grow and the models grow, the slowness of ModelSerializer starts to show. The usual solution for this problem is "use something else", but we prefer to try and improve ModelSerializer's performance first, if possible.

The following is a somewhat realistic benchmark, comparing serialization of one of our models using a ModelSerializer, a regular serializer (it's just the str(ModelSerializer())), and a custom serializer.

Benchmark
#!/usr/bin/env python
import cProfile
import datetime
from time import perf_counter

import django
from django.db import models
from django.conf import settings
from django.utils.translation import gettext_lazy as _
from rest_framework import serializers


settings.configure(INSTALLED_APPS=['rest_framework'])
django.setup()


class SomeOtherModel(models.Model):
    class Meta:
        app_label = 'some_app'
        verbose_name = _('SomeOtherModel')
        verbose_name_plural = _('SomeOtherModels')

    id = models.AutoField(
        primary_key=True,
        verbose_name=_('ID')
    )


class SomeModel(models.Model):
    class Meta:
        app_label = 'some_app'
        verbose_name = _('SomeModel')
        verbose_name_plural = _('SomeModels')

    id = models.AutoField(
        primary_key=True,
        verbose_name=_('ID')
    )
    field1 = models.DateTimeField(
        editable=False,
        db_index=True,
        verbose_name=_('Field1'),
    )
    field2 = models.DateTimeField(
        verbose_name=_('Field2'),
    )
    field3 = models.ForeignKey(
        to=SomeOtherModel,
        on_delete=models.PROTECT,
        verbose_name=_('Field3'),
    )
    field4 = models.ForeignKey(
        to=SomeOtherModel,
        on_delete=models.PROTECT,
        verbose_name=_('Field4'),
    )
    field5 = models.DateTimeField(
        blank=True,
        null=True,
        verbose_name=_('Field5'),
    )
    field6 = models.FloatField(
        blank=True,
        null=True,
        verbose_name=_('Field6'),
    )
    field7 = models.FloatField(
        blank=True,
        null=True,
        verbose_name=_('Field7'),
    )
    field8 = models.DateTimeField(
        blank=True,
        null=True,
        verbose_name=_('Field8'),
    )
    field9 = models.FloatField(
        blank=True,
        null=True,
        verbose_name=_('Field9'),
    )
    field10 = models.CharField(
        max_length=50,
        blank=True,
        null=True,
        verbose_name=_('Field10'),
    )
    field11 = models.IntegerField(
        choices=(
            (1, _('Choice1')),
            (2, _('Choice2')),
        ),
        verbose_name=_('Field11'),
    )
    field12 = models.BigIntegerField(
        verbose_name=_('Field12'),
    )
    field13 = models.ForeignKey(
        to=SomeOtherModel,
        blank=True,
        null=True,
        db_index=False,
        on_delete=models.PROTECT,
        verbose_name=_('Field13'),
    )
    field14 = models.BooleanField(
        blank=True,
        null=True,
        verbose_name=_('Field14'),
    )
    field15 = models.BooleanField(
        verbose_name=_('Field15'),
        null=True,
    )
    field16 = models.BooleanField(
        verbose_name=_('Field16'),
        help_text=_('Help'),
        null=True,
    )
    field17 = models.ForeignKey(
        to=SomeOtherModel,
        null=True,
        blank=True,
        db_index=False,
        on_delete=models.PROTECT,
        verbose_name=_('Field17'),
    )
    field18 = models.GenericIPAddressField(
        verbose_name=_('Field18'),
    )
    field19 = models.ForeignKey(
        to=SomeOtherModel,
        blank=True,
        null=True,
        db_index=False,
        on_delete=models.PROTECT,
        verbose_name=_('Field19'),
    )
    field20 = models.ImageField(
        blank=True,
        null=True,
        verbose_name=_('Field20'),
    )
    field21 = models.ForeignKey(
        to=SomeOtherModel,
        blank=True,
        null=True,
        on_delete=models.PROTECT,
        verbose_name=_('Field21'),
    )
    field22 = models.PositiveSmallIntegerField(
        blank=True,
        null=True,
        verbose_name=_('Field22'),
        help_text=_('Help'),
    )


class ModelSerializer(serializers.ModelSerializer):
    class Meta:
        model = SomeModel
        fields = (
            'id',
            'field1',
            'field2',
            'field3',
            'field4',
            'field5',
            'field6',
            'field7',
            'field8',
            'field9',
            'field10',
            'field11',
            'field12',
            'field13',
            'field14',
            'field15',
            'field16',
            'field17',
            'field18',
            'field19',
            'field20',
            'field21',
            'field22',
        )


class RegularSerializer(serializers.Serializer):
    id = serializers.IntegerField(label='ID', read_only=True)
    field1 = serializers.DateTimeField(read_only=True)
    field2 = serializers.DateTimeField()
    field3 = serializers.PrimaryKeyRelatedField(queryset=SomeOtherModel.objects.all())
    field4 = serializers.PrimaryKeyRelatedField(queryset=SomeOtherModel.objects.all())
    field5 = serializers.DateTimeField(allow_null=True, required=False)
    field6 = serializers.FloatField(allow_null=True, required=False)
    field7 = serializers.FloatField(allow_null=True, required=False)
    field8 = serializers.DateTimeField(allow_null=True, required=False)
    field9 = serializers.FloatField(allow_null=True, required=False)
    field10 = serializers.CharField(allow_blank=True, allow_null=True, max_length=50, required=False)
    field11 = serializers.ChoiceField(choices=((1, 'Choice1'), (2, 'Choice2')), validators=[django.core.validators.MinValueValidator(1), django.core.validators.MaxValueValidator(2)])
    field12 = serializers.IntegerField(max_value=9223372036854775807, min_value=-9223372036854775808)
    field13 = serializers.PrimaryKeyRelatedField(allow_null=True, queryset=SomeOtherModel.objects.all(), required=False)
    field14 = serializers.BooleanField(allow_null=True, required=False)
    field15 = serializers.BooleanField(allow_null=True, required=False)
    field16 = serializers.BooleanField(allow_null=True, help_text='Help', required=False)
    field17 = serializers.PrimaryKeyRelatedField(allow_null=True, queryset=SomeOtherModel.objects.all(), required=False)
    field18 = serializers.IPAddressField()
    field19 = serializers.PrimaryKeyRelatedField(allow_null=True, queryset=SomeOtherModel.objects.all(), required=False)
    field20 = serializers.ImageField(allow_null=True, max_length=100, required=False)
    field21 = serializers.PrimaryKeyRelatedField(allow_null=True, queryset=SomeOtherModel.objects.all(), required=False)
    field22 = serializers.IntegerField(allow_null=True, help_text='Help', max_value=32767, min_value=0, required=False)


def custom_serializer(obj):
    return {
        'field1': obj.field1.isoformat(),
        'field2': obj.field2.isoformat(),
        'field3': obj.field3_id,
        'field4': obj.field4_id,
        'field5': obj.field5.isoformat(),
        'field6': obj.field6,
        'field7': obj.field7,
        'field8': obj.field8.isoformat(),
        'field9': obj.field9,
        'field10': obj.field10,
        'field11': obj.field11,
        'field12': obj.field12,
        'field13': obj.field13_id,
        'field14': obj.field14,
        'field15': obj.field15,
        'field16': obj.field16,
        'field17': obj.field17_id,
        'field18': obj.field18,
        'field19': obj.field19_id,
        'field20': obj.field20,
        'field21': obj.field21_id,
        'field22': obj.field22,
    }


dt = datetime.datetime(2019, 12, 16, 11, 30, 0, 0, datetime.timezone.utc)
instance = SomeModel(
    field1 = dt,
    field2 = dt,
    field3_id = 1000,
    field4_id = 1000,
    field5 = dt,
    field6 = 10.20,
    field7 = 10.20,
    field8 = dt,
    field9 = 10.20,
    field10 = 'abcde',
    field11 = 1,
    field12 = 10000,
    field13_id = 1000,
    field14 = True,
    field15 = False,
    field16 = True,
    field17_id = 1000,
    field18 = '10.20.30.40',
    field19_id = 1000,
    field20 = None,
    field21_id = 1000,
    field22 = 10,
)


def run_model():
    ModelSerializer(instance).data


def run_regular():
    RegularSerializer(instance).data


def run_custom():
    custom_serializer(instance)


# cProfile.run('''''', sort='tottime')
start = perf_counter()
for i in range(5000): run_model()
print('model:', perf_counter() - start)
start = perf_counter()
for i in range(5000): run_regular()
print('regular:', perf_counter() - start)
start = perf_counter()
for i in range(5000): run_custom()
print('custom:', perf_counter() - start)

The results (before this PR):

model:   8.9562s
regular: 3.8447s
custom:  0.0409s

From the profile, it is clear that the difference between model and regular is due to ModelSerializer re-generating the model fields each time it is instantiated.

Solution

This PR only generates the fields once, when the class is defined, using a metaclass. With this change, the benchmark results are

model:   3.0143s
regular: 3.8290s
custom:  0.0403s

The PR contains breaking changes, in that all of the codes participating in generating the fields now becomes classmethods/class attributes instead of self attributes/methods. This is necessary in order to indicate that the code only runs one, and must not use anything from self because it is incidental. Specifically, the fields must be defined on the class now (they almost always are, already):

url_field_name
serializer_field_mapping
serializer_related_field
serializer_choice_field
serializer_url_field

and the following methods become classmethods (breaking for ModelSerializer subclasses which override them):

get_field_names
get_default_field_names
build_field
build_standard_field
build_relational_field
build_nested_field
build_property_field
build_url_field
build_unknown_field
include_extra_kwargs
get_extra_kwargs
get_uniqueness_extra_kwargs

Additionally, errors in the ModelSerializer definition (like e.g. not setting fields = [...]) are now raised on definition time, not on first instantiation.

The DRF tests pass without any changes, except for adapting to when the errors are raised.

Possible further work

On the deserialization side, I believe the validators are generated every time. They can be cached too.

From profiling the benchmark after the changes, the major remaining slowdown is the copy.deepcopy of the fields on each instantiation. For reference, changing to a shallow copy (of each field) instead of a deepcopy brings the model time to 1.3842s (was attempted already before, #4587). The optimal solution would be to make the fields immutable, thus not requiring copy at all -- but that is a larger breaking change presumably.

@auvipy
Copy link
Member

auvipy commented Dec 16, 2019

thanks for tackling this!

Copy link
Member

@rpkilby rpkilby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally in favor of this PR, left a few comments below.

I think the only major concern is if there are any common use cases that rely on any of the serializer instance attributes. e.g., depending on partial or the context.

rest_framework/serializers.py Outdated Show resolved Hide resolved
# Methods for determining the set of field names to include...

def get_field_names(self, declared_fields, info):
@classmethod
def get_field_names(cls, declared_fields, info):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these methods are part of the public API, we can expect that users have overridden them as instance methods. Since we're transitioning to classmethods, we should provide a user-friendly warning informing them of the change. Should be fairly straightforward as:

# this will fail if `get_field_names` is an instance method.
assert inspect.ismethod(cls.get_field_names), "<helpful message>"
# or
assert inspect.ismethod(type(self).get_field_names), "<helpful message>"

This should probably be raised sometime during class creation (possibly the aforementioned metaclass).

rest_framework/serializers.py Outdated Show resolved Hide resolved
@bluetech
Copy link
Contributor Author

Thanks for the comments @rpkilby.

I remembered from a previous PR that DRF has a requirement to work even before Django is initialized (checked by tests/importable), that's why I set up the cache on the first instantiation. But now I figure that ModelSerializers are exempt from that, because Models themselves cannot be defined before django is initialized. So I changed to use a metaclass. Now the fields are generated when the ModelSerializer is defined, rather than when it's first instantiated. This also means definition errors are raised immediately, so the tests do need some adjustments - I do think it's the better behavior though.

I changed the name to base_fields. Do note that SerializerMetaclass uses _declared_fields for the declared fields, so it's a bit inconsistent - maybe we want _fields for this?

Regarding the compat concerns, I left it out for now, to get feedback on the "end state" first. But if we agree on this approach, I'll come up with a proposal for the transition. I think we'll want at least one deprecation cycle where the self methods still work.

@rpkilby
Copy link
Member

rpkilby commented Dec 17, 2019

I remembered from a previous PR that DRF has a requirement to work even before Django is initialized (checked by tests/importable), that's why I set up the cache on the first instantiation. But now I figure that ModelSerializers are exempt from that, because Models themselves cannot be defined before django is initialized.

One quick comment - DRF itself needs to be importable, however this does not extend to user-defined serializers. Building fields would only happen in user-defined serializer classes, so these changes don't affect DRF importability.

@bluetech
Copy link
Contributor Author

On the deserialization side, I believe the validators are generated every time. They can be cached too.

Here is a commit which does this (not for this PR): bluetech@067147c

@rsiemens
Copy link
Contributor

rsiemens commented Mar 5, 2020

@rpkilby @bluetech is there anything that needs to be done to move this PR forward? Is there anything I can do to help it along? Thanks for all the work on this!

@rpkilby
Copy link
Member

rpkilby commented Mar 6, 2020

Hi @rsiemens. At this point, it's an issue of bandwidth of the maintenance team. While the changes aren't complicated, they are non-trivial (the public-facing serializer API has been updated). We need to consider things like deprecations for users who have overridden these methods, etc.

I'm adding to the 3.12 milestone to help ensure this is looked at, but no guarantees.

@rpkilby rpkilby added this to the 3.12 Release milestone Mar 6, 2020
@rsiemens
Copy link
Contributor

rsiemens commented Mar 6, 2020

Thanks for the update! Happy to lend a hand if needed.

@tomchristie
Copy link
Member

My concern would be that the change footprint may be too big on this. It looks pretty risky.
I can easily see this introducing unexpected breakages, or interacting in an unexpected way with some user code that'd already overridden some field generating behavior.

@bluetech
Copy link
Contributor Author

This definitely carries some breakage risk. I figure such risk is not acceptable at this stage of DRF's life. So I'm going to close this now -- thanks for considering!

FWIW, we use an internal fork of DRF with this patch and some others to improve performance, though it also removes some features we don't use which got in the way. My plan is to make the field instances immutable which would also remove the copy.deepcopy overhead, however I haven't got around to that yet. After that, I expect DRF model serializers to be quite speedy.

@bluetech bluetech closed this Mar 11, 2020
@rpkilby
Copy link
Member

rpkilby commented Mar 25, 2020

or interacting in an unexpected way with some user code that'd already overridden some field generating behavior.

I think there are generally two cases to consider:

  • Serializers that have overridden the existing instance methods.
  • Serializers that have overridden the methods, but also depend on instance attributes.

For the former, I'm not too concerned. We can detect whether the method is a class or an instance method and raise a warning notifying the user of the change. In a lot of these cases, they probably just need to wrap the method in @classmethod. The only downside is that I don't think there's a clean deprecation path *. We'd just need to loudly raise a helpful error message.

For the latter, there isn't really a good option. If their code is dependent on instance variables, then there isn't an easy migration path to the new classmethods. That said not sure how common this is, and per-instance field modifications could probably be moved to another serializer method.

* I actually do have an idea for a clean deprecation path, but whether or not it's a good idea remains to be seen.

@mohmyo
Copy link
Contributor

mohmyo commented Jul 9, 2020

Serialization performance in general and for ModelSerializer specially has been always unsatisfying for a long time, I think going down this road is an important and a great step even if the road wasn't the best one, I think the most would accept a little bit of small road bumps for such a great addition.

@carltongibson
Copy link
Collaborator

The low risk approach here is to begin this as a third-party package. from speedy import serializers and off you go.
There shouldn't be any reason why this wouldn't be feasible.

Once it's shown to work/be stable there's a case to be made for a change rest_framework but...

This definitely carries some breakage risk. I figure such risk is not acceptable at this stage of DRF's life.

I think that's the key right? (We really can't just break things...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants