Losslessly represent omitted fields #344

SyntaxColoring · 2023-03-19T18:11:54Z

Description

I sometimes have JSON objects where the presence or absence of a field is semantically distinct from whether that field's value is null.

In keeping with msgspec's principles of strictness and correctness, I'd like for the library to be able to losslessly and reversibly encode and decode these objects.

Example use case

Suppose you have a database of software bugs. Each bug has a title, and, optionally, a single assignee.

You want to use JSON to represent search filters.

This filter would mean "find all issues titled 'App crashes' that are assigned to @JohnDoe":

{
    "title": "App crashes"
    "assignee": "@JohnDoe"
}

And this would mean "find all issues titled 'App crashes' that don't have an assignee":

{
    "title": "App crashes"
    "assignee": null
}

And this would mean "find all issues titled 'App crashes' regardless of assignee":

{
    "title": "App crashes"
}

In JSON Schema, I would represent this like this:

{
    "$ref": "#/$defs/SearchFilter",
    "$defs": {
        "SearchFilter": {
            "title": "SearchFilter",
            "type": "object",
            "properties": {
                "title": {
                    "type": "string"
                },
                "assignee": {
                    "anyOf": [
                        { "type": "string" },
                        { "type": "null" }
                    ]
                }
            }
            "required": []
        }
    }
}

Proposed API design

Using msgspec, I would want to implement the example above something like this:

class SearchFilter(msgspec.Struct):
    # OMITTED and OMITTED_TYPE do not exist today.
    title: str | msgspec.OMITTED_TYPE = msgspec.OMITTED
    assignee: str | None | msgspec.OMITTED_TYPE = msgspec.OMITTED


def filter(all_bugs: list[Bug], filter: SearchFilter) -> list[Bug]:
    def matches_filter(bug: Bug) -> bool:
        title_matches = filter.title == msgspec.OMITTED or filter.title == bug.title
        assignee_matches = filter.assignee == msgspec.OMITTED or filter.assignee == bug.assignee
        return title_matches and assignee_matches
    return [bug for bug in all_bugs if matches_filter(bug)]

Two new symbols are added to msgspec: OMITTED and OMITTED_TYPE.
- OMITTED is a unique sentinel value, distinct from None.
- OMITTED_TYPE is the type of OMITTED, probably an alias of typing.Literal[OMITTED].
- Perhaps values of OMITTED evaluate to falsey, like None does.
When encode() encounters a value of OMITTED, it omits the entire key-value pair from the output.
The decode() behavior is unchanged. When it decodes a message that has a certain field missing, it returns that field's default value. In this case, that value can happen to be the sentinel value OMITTED.

Prior art in other libraries

PEP 655

PEP 655 discusses the same problem for TypedDicts.

They solve it in a different way. Instead of having a special sentinel value like I'm proposing, they introduce the wrapping types typing.Required[T] and typing.NotRequired[T]. (These appear to be pass-throughs to T at run time. I guess you're supposed to use the in operator to gate any potentially unsafe accesses, but mypy doesn't enforce this today.)

They also explicitly reject the name "omittable."

It feels to me like a lot of their arguments don't make sense when applied to msgspec.Structs, as opposed to dicts. But I haven't spent the time to try these things out, or to read the mailing lists and dig into their thinking. Maybe they're right.

Pydantic

Pydantic has historically badly conflated these concepts. Planned changes for v2.0 look like they'll bring Pydantic to parity with msgspec as it exists today, but they won't address the problem I'm describing here.

The text was updated successfully, but these errors were encountered:

jcrist · 2023-03-19T19:43:58Z

Thanks for opening this, this seems like a well thought out feature, and shouldn't be too hard to implement. I think the only open question here would be the naming.

I don't like using the name Omitted here, since it may conflict with how we spell omitted fields whenever that's implemented (#199). My initial instinct is to use the name UNSET. This would unfortunately require a breaking change for existing internal consumers of the msgspec.UNSET sentinel, but I doubt anyone is using it directly outside msgspec since it was only recently added, and is a fringe API.

For typing, we could do something like the following to make spelling these fields easier:

# This type annotation helper would be in `msgspec`:
# MaybeUnset = Union[T, UnsetType]

from msgspec import Struct, MaybeUnset, UNSET

class SearchFilter(Struct):
    title: MaybeUnset[str] = UNSET
    assignee: MaybeUnset[str | None] = UNSET

I don't love the name MaybeUnset, but Unsettable sounds more like a field that can never be set.

Alternatively we could use UNDEFINED and MaybeUndefined (or some other equivalent name) for the same concepts. This may better mirror javascript semantics, but I'm not sure if a Python programmer would find the meaning/usage of UNDEFINED here as clear. This option would also avoid the need for a breaking change. I'm not against breaking changes now, since this library still isn't at 1.0, but if they can be avoided that's always a plus.

Thoughts?

jcrist · 2023-03-19T20:04:46Z

The more I think about this, the more I like using "undefined" here. This would also let us mirror JSON.stringify behavior, where an undefined in a non-object-value location is encoded as null instead of erroring (so [1, undefined] would encode as [1, null]).

Still not sure about the singleton/type naming/casing. Thoughts here would be very welcome.

Possible singleton names: msgspec.UNDEFINED, msgspec.undefined, msgspec.Undefined, ?

Possible type names: msgspec.UndefinedType, msgspec.Undefined, ?

# Just playing around with spellings here
class Example(msgspec.Struct):
    field_1: str | msgspec.UndefinedType = msgspec.UNDEFINED
    field_2: str | msgspec.Undefined = msgspec.undefined
    field_3: str | msgspec.UndefinedType = msgspec.Undefined

SyntaxColoring · 2023-03-20T13:39:37Z

Thanks for your quick response!

I don't like using the name Omitted here [...] My initial instinct is to use the name UNSET [...]

For typing, we could do something like the following to make spelling these fields easier:
# This type annotation helper would be in `msgspec`:
# MaybeUnset = Union[T, UnsetType]
I don't love the name MaybeUnset, but Unsettable` sounds more like a field that can never be set.

👍 Agreed with all of this, if OMITTED is unavailable.

Alternatively we could use UNDEFINED and MaybeUndefined (or some other equivalent name) for the same concepts. This may better mirror javascript semantics, but I'm not sure if a Python programmer would find the meaning/usage of UNDEFINED here as clear.

I worry that calling it "undefined" could lure library users and contributors into pursuing JavaScript semantics, where those semantics aren't necessarily appropriate for a strict and fast serialization and validation library in Python. JSON isn't JavaScript, basically.

For instance, doing my_struct.field_that_does_not_exist will (mercifully) raise AttributeError in Python, not return undefined.

The more I think about this, the more I like using "undefined" here. This would also let us mirror JSON.stringify behavior, where an undefined in a non-object-value location is encoded as null instead of erroring (so [1, undefined] would encode as [1, null]).

Another case in point—this sounds confusing to me. (At least, that's my knee-jerk reaction). If I wanted nulls in my output, I would have declared the list as list[int | None] and inserted None elements. I'd expect using the special sentinel there to be an error, as if I had tried to encode any other nonsensical object, like a file or whatever.

Still not sure about the singleton/type naming/casing. Thoughts here would be very welcome.

msgspec.FooType is consistent with NoneType, so that's nice.

msgspec.Foo would be consistent with None, but msgspec.FOO would be consistent with PEP 8 constants. I guess I'd personally find msgspec.FOO less surprising; msgspec.Foo looks like an instantiable class.

Despite all of the above, I'd be totally happy to try out any naming scheme. Like you said, the library's still not at v1.0, so names can change later if practical experience shows that our initial choices were confusing.

Thanks again for being responsive to this!

jcrist · 2023-03-23T02:56:55Z

I've pushed up #350 to fix this. The semantics are pretty much what you describe above, with the singleton named msgspec.UNSET and the type named msgspec.UnsetType. If you have some time, I'd appreciate a once over on the docs in that PR to make sure usage makes sense to you. Thanks!

jcrist mentioned this issue Mar 23, 2023

Use msgspec.UNSET for tracking unset fields #350

Merged

jcrist closed this as completed in #350 Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Losslessly represent omitted fields #344

Losslessly represent omitted fields #344

SyntaxColoring commented Mar 19, 2023 •

edited

Loading

jcrist commented Mar 19, 2023 •

edited

Loading

jcrist commented Mar 19, 2023

SyntaxColoring commented Mar 20, 2023 •

edited

Loading

jcrist commented Mar 23, 2023

Losslessly represent omitted fields #344

Losslessly represent omitted fields #344

Comments

SyntaxColoring commented Mar 19, 2023 • edited Loading

Description

Example use case

Proposed API design

Prior art in other libraries

PEP 655

Pydantic

jcrist commented Mar 19, 2023 • edited Loading

jcrist commented Mar 19, 2023

SyntaxColoring commented Mar 20, 2023 • edited Loading

jcrist commented Mar 23, 2023

SyntaxColoring commented Mar 19, 2023 •

edited

Loading

jcrist commented Mar 19, 2023 •

edited

Loading

SyntaxColoring commented Mar 20, 2023 •

edited

Loading