Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losslessly represent omitted fields #344

Closed
SyntaxColoring opened this issue Mar 19, 2023 · 4 comments · Fixed by #350
Closed

Losslessly represent omitted fields #344

SyntaxColoring opened this issue Mar 19, 2023 · 4 comments · Fixed by #350

Comments

@SyntaxColoring
Copy link

SyntaxColoring commented Mar 19, 2023

Description

I sometimes have JSON objects where the presence or absence of a field is semantically distinct from whether that field's value is null.

In keeping with msgspec's principles of strictness and correctness, I'd like for the library to be able to losslessly and reversibly encode and decode these objects.

Example use case

Suppose you have a database of software bugs. Each bug has a title, and, optionally, a single assignee.

You want to use JSON to represent search filters.

This filter would mean "find all issues titled 'App crashes' that are assigned to @JohnDoe":

{
    "title": "App crashes"
    "assignee": "@JohnDoe"
}

And this would mean "find all issues titled 'App crashes' that don't have an assignee":

{
    "title": "App crashes"
    "assignee": null
}

And this would mean "find all issues titled 'App crashes' regardless of assignee":

{
    "title": "App crashes"
}

In JSON Schema, I would represent this like this:

{
    "$ref": "#/$defs/SearchFilter",
    "$defs": {
        "SearchFilter": {
            "title": "SearchFilter",
            "type": "object",
            "properties": {
                "title": {
                    "type": "string"
                },
                "assignee": {
                    "anyOf": [
                        { "type": "string" },
                        { "type": "null" }
                    ]
                }
            }
            "required": []
        }
    }
}

Proposed API design

Using msgspec, I would want to implement the example above something like this:

class SearchFilter(msgspec.Struct):
    # OMITTED and OMITTED_TYPE do not exist today.
    title: str | msgspec.OMITTED_TYPE = msgspec.OMITTED
    assignee: str | None | msgspec.OMITTED_TYPE = msgspec.OMITTED


def filter(all_bugs: list[Bug], filter: SearchFilter) -> list[Bug]:
    def matches_filter(bug: Bug) -> bool:
        title_matches = filter.title == msgspec.OMITTED or filter.title == bug.title
        assignee_matches = filter.assignee == msgspec.OMITTED or filter.assignee == bug.assignee
        return title_matches and assignee_matches
    return [bug for bug in all_bugs if matches_filter(bug)]
  • Two new symbols are added to msgspec: OMITTED and OMITTED_TYPE.
    • OMITTED is a unique sentinel value, distinct from None.
    • OMITTED_TYPE is the type of OMITTED, probably an alias of typing.Literal[OMITTED].
    • Perhaps values of OMITTED evaluate to falsey, like None does.
  • When encode() encounters a value of OMITTED, it omits the entire key-value pair from the output.
  • The decode() behavior is unchanged. When it decodes a message that has a certain field missing, it returns that field's default value. In this case, that value can happen to be the sentinel value OMITTED.

Prior art in other libraries

PEP 655

PEP 655 discusses the same problem for TypedDicts.

They solve it in a different way. Instead of having a special sentinel value like I'm proposing, they introduce the wrapping types typing.Required[T] and typing.NotRequired[T]. (These appear to be pass-throughs to T at run time. I guess you're supposed to use the in operator to gate any potentially unsafe accesses, but mypy doesn't enforce this today.)

They also explicitly reject the name "omittable."

It feels to me like a lot of their arguments don't make sense when applied to msgspec.Structs, as opposed to dicts. But I haven't spent the time to try these things out, or to read the mailing lists and dig into their thinking. Maybe they're right.

Pydantic

Pydantic has historically badly conflated these concepts. Planned changes for v2.0 look like they'll bring Pydantic to parity with msgspec as it exists today, but they won't address the problem I'm describing here.

@jcrist
Copy link
Owner

jcrist commented Mar 19, 2023

Thanks for opening this, this seems like a well thought out feature, and shouldn't be too hard to implement. I think the only open question here would be the naming.

I don't like using the name Omitted here, since it may conflict with how we spell omitted fields whenever that's implemented (#199). My initial instinct is to use the name UNSET. This would unfortunately require a breaking change for existing internal consumers of the msgspec.UNSET sentinel, but I doubt anyone is using it directly outside msgspec since it was only recently added, and is a fringe API.

For typing, we could do something like the following to make spelling these fields easier:

# This type annotation helper would be in `msgspec`:
# MaybeUnset = Union[T, UnsetType]

from msgspec import Struct, MaybeUnset, UNSET

class SearchFilter(Struct):
    title: MaybeUnset[str] = UNSET
    assignee: MaybeUnset[str | None] = UNSET

I don't love the name MaybeUnset, but Unsettable sounds more like a field that can never be set.

Alternatively we could use UNDEFINED and MaybeUndefined (or some other equivalent name) for the same concepts. This may better mirror javascript semantics, but I'm not sure if a Python programmer would find the meaning/usage of UNDEFINED here as clear. This option would also avoid the need for a breaking change. I'm not against breaking changes now, since this library still isn't at 1.0, but if they can be avoided that's always a plus.

Thoughts?

@jcrist
Copy link
Owner

jcrist commented Mar 19, 2023

The more I think about this, the more I like using "undefined" here. This would also let us mirror JSON.stringify behavior, where an undefined in a non-object-value location is encoded as null instead of erroring (so [1, undefined] would encode as [1, null]).

Still not sure about the singleton/type naming/casing. Thoughts here would be very welcome.

Possible singleton names: msgspec.UNDEFINED, msgspec.undefined, msgspec.Undefined, ?

Possible type names: msgspec.UndefinedType, msgspec.Undefined, ?

# Just playing around with spellings here
class Example(msgspec.Struct):
    field_1: str | msgspec.UndefinedType = msgspec.UNDEFINED
    field_2: str | msgspec.Undefined = msgspec.undefined
    field_3: str | msgspec.UndefinedType = msgspec.Undefined

@SyntaxColoring
Copy link
Author

SyntaxColoring commented Mar 20, 2023

Thanks for your quick response!

I don't like using the name Omitted here [...] My initial instinct is to use the name UNSET [...]

For typing, we could do something like the following to make spelling these fields easier:

# This type annotation helper would be in `msgspec`:
# MaybeUnset = Union[T, UnsetType]

I don't love the name MaybeUnset, but Unsettable` sounds more like a field that can never be set.

👍 Agreed with all of this, if OMITTED is unavailable.


Alternatively we could use UNDEFINED and MaybeUndefined (or some other equivalent name) for the same concepts. This may better mirror javascript semantics, but I'm not sure if a Python programmer would find the meaning/usage of UNDEFINED here as clear.

I worry that calling it "undefined" could lure library users and contributors into pursuing JavaScript semantics, where those semantics aren't necessarily appropriate for a strict and fast serialization and validation library in Python. JSON isn't JavaScript, basically.

For instance, doing my_struct.field_that_does_not_exist will (mercifully) raise AttributeError in Python, not return undefined.

The more I think about this, the more I like using "undefined" here. This would also let us mirror JSON.stringify behavior, where an undefined in a non-object-value location is encoded as null instead of erroring (so [1, undefined] would encode as [1, null]).

Another case in point—this sounds confusing to me. (At least, that's my knee-jerk reaction). If I wanted nulls in my output, I would have declared the list as list[int | None] and inserted None elements. I'd expect using the special sentinel there to be an error, as if I had tried to encode any other nonsensical object, like a file or whatever.


Still not sure about the singleton/type naming/casing. Thoughts here would be very welcome.

msgspec.FooType is consistent with NoneType, so that's nice.

msgspec.Foo would be consistent with None, but msgspec.FOO would be consistent with PEP 8 constants. I guess I'd personally find msgspec.FOO less surprising; msgspec.Foo looks like an instantiable class.


Despite all of the above, I'd be totally happy to try out any naming scheme. Like you said, the library's still not at v1.0, so names can change later if practical experience shows that our initial choices were confusing.

Thanks again for being responsive to this!

@jcrist
Copy link
Owner

jcrist commented Mar 23, 2023

I've pushed up #350 to fix this. The semantics are pretty much what you describe above, with the singleton named msgspec.UNSET and the type named msgspec.UnsetType. If you have some time, I'd appreciate a once over on the docs in that PR to make sure usage makes sense to you. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants