Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strict_dataclass #20461

Closed
wants to merge 2 commits into from
Closed

strict_dataclass #20461

wants to merge 2 commits into from

Conversation

sryza
Copy link
Contributor

@sryza sryza commented Mar 13, 2024

Summary & Motivation

This one was a doozy.

Introduces a @strict_dataclass decorator, with the aim of replacing our usage of NamedTuple and pydantic.BaseModel.

Why not pydantic.BaseModel?

  1. This would require remembering to use frozen=True, which is annoying and error-prone.
  2. It doesn't work with positional arguments. Even if we declare positional arguments bad for new classes, many of our existing NamedTuples are public APIs that will need to support positional arguments for the forseeable future.

Why not @pydantic.dataclass(frozen=True, config=dict(extra="forbid"))

This would require remembering to use frozen=True, config=dict(extra="forbid"), which is annoying and error-prone.

How I Tested These Changes

Copy link
Contributor

@maximearmstrong maximearmstrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Member

@schrockn schrockn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on brief googling I'm not persuaded that we should wrap dataclass rather than BaseModel.

BaseModel is the main way that people use pydantic and I think our default position should be reflecting the default way that people use it.

I'm also a little wary of adding the dependency on dataclass_transform here.

@schrockn
Copy link
Member

How many of our namedtuples are public APIs and would need to support positionals?

I also think it would be fairly straightforward to support positional by customizing __init__ our subclass.

@sryza
Copy link
Contributor Author

sryza commented Mar 13, 2024

BaseModel is the main way that people use pydantic and I think our default position should be reflecting the default way that people use it.

Do you have evidence for this? Also, my suspicion is that a lot more people are familiar with dataclasses than pydantic.

I'm also a little wary of adding the dependency on dataclass_transform here.

pydantic.dataclasses.dataclass already uses it FWIW (which is how I discovered it).

@sryza
Copy link
Contributor Author

sryza commented Mar 13, 2024

How many of our namedtuples are public APIs and would need to support positionals?

This is a sample from scanning through the first ~third of our public exports:

  • AssetCheckSpec
  • AssetCheckKey
  • AssetKey
  • AssetIn
  • AssetMaterialization
  • AssetOut
  • DataProvenance
  • DataVersion
  • DependencyDefinition
  • MultiDependencyDefinition
  • NodeInvocation
  • GraphIn
  • GraphOut
  • In
  • Out
  • InputMapping
  • OutputMapping

@schrockn
Copy link
Member

Do you have evidence for this? Also, my suspicion is that a lot more people are familiar with dataclasses than pydantic.

All of their docs lead with it:

Screenshot 2024-03-13 at 5 49 03 PM

All examples of pydantic usage I've ever seen uses BaseModel

Dataclasses are presented as a compatibility layer.

If we want the dataclass API, why not just wrap the out-of-the-box dataclass in python?

@sryza
Copy link
Contributor Author

sryza commented Mar 13, 2024

@schrockn it's certainly not the only benefit of pydantic, but probably the #1 thing that makes me excited about switching from NamedTuples to pydantic is the opportunity to get rid of all our boilerplate constructors. So would be very sad to move to a pydantic approach that requires to keep many of these.

@sryza
Copy link
Contributor Author

sryza commented Mar 13, 2024

All of their docs lead with it:

Fair.

If we want the dataclass API, why not just wrap the out-of-the-box dataclass in python?

It doesn't do runtime type-checking, so it wouldn't be a replacement for all the type-checking code we have.

@schrockn
Copy link
Member

It doesn't do runtime type-checking, so it wouldn't be a replacement for all the type-checking code we have.

Ah yes of course.

@schrockn it's certainly not the only benefit of pydantic, but probably the number 1 thing that makes me excited about switching from NamedTuples to pydantic is the opportunity to get rid of all our boilerplate constructors. So would be very sad to move to a pydantic approach that requires to keep many of these.

Yes that will be nice. However just to manage expectations that we will still need custom initializers for any serdes class that has changed, or for any class where we want to massage arguments (e.g. CoercibleToAssetKey argument that ends up an AssetKey field).

In terms of my comment "I also think it would be fairly straightforward to support positional by customizing init our subclass" I want to be clear I'm talking about supporting this is our own BaseModel class, rather than forcing all subclasses to customize init

@sryza
Copy link
Contributor Author

sryza commented Mar 13, 2024

In terms of my comment "I also think it would be fairly straightforward to support positional by customizing init our subclass" I want to be clear I'm talking about supporting this is our own BaseModel class, rather than forcing all subclasses to customize init

Ah I misunderstood. If we can find a way to do this then I agree that's the best option. However I did a bunch of investigation on this this morning and couldn't figure out a way to do it that preserves static type checking. If you have ideas about avenues to investigate to get this to work, I'd be happy to explore them.

@schrockn
Copy link
Member

schrockn commented Mar 14, 2024

Ah I misunderstood. If we can find a way to do this then I agree that's the best option. However I did a bunch of investigation on this this morning and couldn't figure out a way to do it that preserves static type checking. If you have ideas about avenues to investigate to get this to work, I'd be happy to explore them.

Yeah the version I imagine is probably similar to what you prototyped where it accepts *args and **kwargs and does some reflection-based remapping, which probably breaks static typing. @smackesey may have some additional ideas here.

My recommendation is a dagster-specific class Inheriting from BaseModel (that sets frozen to True, etc) rather than a dataclass wrapper since that is "standard"/"default" pydantic.

In terms of those existing public classes, a good deal of them require custom initialization anyways. Any class that has been serialized and changed at all, and any class that does any manipulation (e.g. a normalize_metadata call or a conversion from CoercibleToAssetKey) will require one anyways, which is a lot (most by my initial sample) of those. We will still be able to eliminate the check calls in most cases and defer to Pydantic's built-in runtime checking.

For new classes we can use the built-in init logic until we need to change serialization or do some transformation on inputs.

This approach also has the advantage of being more minimal.

from dagster._core.definitions.events import AssetKey, CoercibleToAssetKey
from pydantic import BaseModel, Extra


class Pedantic(BaseModel):
    class Config:
        frozen = True
        extra = Extra.forbid


class ThingThatAcceptsCoercibleToAssetKey(Pedantic):
    def __init__(self, asset_key: CoercibleToAssetKey):
        super().__init__(**dict(asset_key=AssetKey.from_coercible(asset_key)))

    asset_key: AssetKey


print(ThingThatAcceptsCoercibleToAssetKey("asset_key"))

I would be surprised if there weren't edge case bugs in the cross product of pydantic versions, python versions, and typing_extension versions in the dataclass wrapper variant. Whereas in the above there is less chance of that.

@sryza
Copy link
Contributor Author

sryza commented Mar 14, 2024

In terms of those existing public classes, a good deal of them require custom initialization anyways.

Of the classes listed in my comment above, I believe the 6 whose names start with "Asset" all require special constructors, and the other 10 do not. When I sampled a few more further down in the file, I got a similar ratio.

My recommendation is a dagster-specific class Inheriting from BaseModel (that sets frozen to True, etc) rather than a dataclass wrapper since that is "standard"/"default" pydantic.

Yeah this was the first thing that I tried and it has a lot of things going for it. But still concerned about the amount of boilerplate it implies.

I would be surprised if there weren't edge case bugs in the cross product of pydantic versions, python versions, and typing_extension versions in the dataclass wrapper variant. Whereas in the above there is less chance of that.

Would definitely be bad if this were the case. FWIW the minimum version of pydantic that we support imports typing_extensions.dataclass_transform: https://github.com/pydantic/pydantic/blob/v1.10.1/pydantic/dataclasses.py#L39

I suspect you won't be super hot on this, but just to enumerate the space of options, another direction to consider is supporting both. E.g. we could have a subclass of BaseModel similar to your example above as our general recommended way of defining new classes, and also a @strict_dataclass_positional_args decorator for the subset with positional args.

@sryza
Copy link
Contributor Author

sryza commented Mar 14, 2024

A point against the dataclass approach, which gives me pause, is that overriding init gets more complicated. Will need to play around with this but I think we might need to override new instead.

@schrockn
Copy link
Member

Of the classes listed in my comment #20461 (comment), I believe the 6 whose names start with "Asset" all require special constructors, and the other 10 do not. When I sampled a few more further down in the file, I got a similar ratio.

Don't think so. Any of them that have dagster_type need one because of:

        return super(In, cls).__new__(
            cls,
            dagster_type=(
                NoValueSentinel
                if dagster_type is NoValueSentinel
                else resolve_dagster_type(dagster_type)
            ),

Yeah this was the first thing that I tried and it has a lot of things going for it. But still concerned about the amount of boilerplate it implies.

Given the above (and the point about serializable classes), I think both approaches have similar amounts of boilerplate in practice and in the cases where we do need a custom constructor we won't need the check calls so it will feel quite a bit better.

@schrockn
Copy link
Member

Serdes change POC: #20470

@sryza sryza closed this Mar 14, 2024
@sryza sryza mentioned this pull request Mar 21, 2024
sryza added a commit that referenced this pull request Mar 25, 2024
## Summary & Motivation

Introduces a `StrictModel` class, which is a subclass of
`pydantic.BaseModel` that's frozen and doesn't allow default constructor
arguments that aren't class members.

The idea is for this to eventually replace all our uses of `NamedTuple`,
`BaseModel`, and `@dataclass`.

This is a replacement for the unmerged
#20461.

Downstream PRs that demonstrate its usage:
- #20641
- #20643
- #20638

## How I Tested These Changes
PedramNavid pushed a commit that referenced this pull request Mar 28, 2024
## Summary & Motivation

Introduces a `StrictModel` class, which is a subclass of
`pydantic.BaseModel` that's frozen and doesn't allow default constructor
arguments that aren't class members.

The idea is for this to eventually replace all our uses of `NamedTuple`,
`BaseModel`, and `@dataclass`.

This is a replacement for the unmerged
#20461.

Downstream PRs that demonstrate its usage:
- #20641
- #20643
- #20638

## How I Tested These Changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants