-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit type specification for unions #36
Comments
Polyfield looks interesting. I think it'd be worthwhile to make sure we're not duplicating ideas that may have already been well worked out in, say, marshmallow-jsonschema. |
To get it on the record and out of my tab list... |
|
As I think about looking at my own files I do really like the lack of an extra layer and the explicit type specification of every item. So I'm thinking that perhaps both options would be good. Of course, two ways has plenty of cost so... I suspect the wrapper-layer approach is only going to be of significant interest to people that do not want types indicated in the serialized data but that also want functional unions. The wrapper layer is the least intrusive to the serialized form. This is a property of a field. Just union fields? Maybe, I haven't thought about any others. The identifer/type relationship is local. The integrated form is nice when you do it everywhere. It is only one extra line per object and no indentation. This is more of a property of the class to be serialized. The identifier/type relationship is more 'global'. It doesn't have to literally be global but it isn't part of a field or a class. I think the two may be able to cleanly coexist not only as features in desert but even in single serialized data blobs. I suppose there could be some silliness if you used different names for the two methods. {
"type": "MyClass",
"value": {
"_type": "color_stuff",
"red": 0,
"green": 1,
"blue": 0.5
}
} So... meh. Maybe if all classes to be serialized in the union are internally typed then the union automatically drops the external typing layer? Too implicit? Maybe there are hazards in some scenarios such as serializing with one set of classes then adding a new option that isn't internally typed and the layers are expected but not present? The developer must choose but we give warnings or otherwise provide a means of testing that if their choice isn't obviously sensible? There's also the matter of subclasses that I put in the WIP list over in polyfield but haven't thought through yet. IIRC, the types registered with the union are expecting to dict-lookup match the types of the actual objects. No conclusions here I suppose, just some thoughts to revisit later. |
Where's the indication that 1 is a float? Do we only want that if the defining schema has a union field for "green" (which it presumably doesn't here)? |
Correct, no union for green. It was meant to just be a float. The indication is in the schema (unless marshmallow does 1.0, I don't know). For people that want the integrated type form for native JSON (or more general) types, we could provide fields that have integrated types and a way to trigger them. Or I suppose the same for wrapped-layer. For reference, I use |
If we are assuming the loader's schema is the same as the dumpers, and key ordering is canonicalized or maintained, there could be a For example, {
"data": {"things": [{"color": "red"}, {"color": "green"}, {"color": "blue"}]},
"union_indexes": [0,0,1],
"schema_hash": "01abef8" # or at least "schema_name"
} |
Hi all, I'm new to this project 👋Very interested in seeing this issue (and #89) get resolved. I've tried both I happened to come across a good writeup of the various options for explicit type specification for unions, courtesy of the I think supporting a choice of explicit representation (and if possible, compatibility with Serde for cross-language applications) would really set this library apart from the competition, and if that's the direction chosen by the maintainers, I'd love to be part of that journey. |
Hey @obi1kenobi, welcome! That Serde link gives a great rundown of our options. @altendky's polyfield link above looks like what Serde calls "adjacently tagged", and marshmallow-union looks like the "untagged" option. I'm +1 on providing built-in support for all four options discussed there. |
So let's see if my original example can still be used to show the four options here. Class Definitionsimport typing
import attr
import desert
@attr.dataclass
class A:
color: str
@attr.dataclass
class B:
color: str
@attr.dataclass
class C:
things: typing.List[typing.Union[A, B]] Sample Datainstances = C(
things=[
A(color='red'),
A(color='green'),
B(color='blue'),
],
) I'll use some 'non-standard' formatting and hopefully it'll end up more readable including considerations of fitting on the screen. I am also only applying it to the members of the attribute list. I think that may be an orthogonal question. 1) How to tag when being explicit and 2) whether to tag even when there isn't any ambiguity. Externally Tagged{
"things": [
{"A": {"color": "red"}},
{"A": {"color": "green"}},
{"B": {"color": "blue"}}
]
} Internally Tagged{
"things": [
{"type": "A", "color": "red"},
{"type": "A", "color": "green"},
{"type": "B", "color": "blue"}
]
} Adjacently Tagged{
"things": [
{"t": "A", "c": {"color": "red"}},
{"t": "A", "c": {"color": "green"}},
{"t": "B", "c": {"color": "blue"}}
]
} Untagged{
"things": [
{"color": "red"},
{"color": "green"},
{"color": "blue"}
]
} If something looks wrong I think it will be worth updating in place here so just let me know and I'll tweak it. Are the specific strings As noted in the provided link (thanks), internal tagging doesn't apply to things serialized as JSON arrays rather than objects. If we wanted we could make the first element be the tag I suppose. Side note, should we talk more about marshalling to Python dict/list/int/float/str/bool/None rather than serializing to JSON object/array/number/true/false/null? If we go big (instead of going home) and try to allow for multiple backends etc this seems maybe more relevant? I dunno, I don't have this big architecture plan in my head yet. Going big might also suggest it would be useful to implement the disambiguation in desert itself rather than delegating to a new implementation specific to each backend. Maybe... |
I'm guessing they come from the |
I think that's what we're doing already, since |
I may just mean that I've barely used desert yet, yeah. |
Edit: moved this idea into a new issue. |
Will it be easiest for us to use polyfield as a dependency to get this feature or should we do it independently? |
After a brief look, I think polyfield is focused on unioning over schemas, not other types. So while that'd work fine in the example above, it'd be awkward in the case of int, float, or other non-struct types. |
If I'm thinking about this right we need class AdjacentlyTaggedUnion(marshmallow.fields.Field): ...
class ExternallyTaggedUnion(marshmallow.fields.Field): ...
class InternallyTaggedUnion(marshmallow.fields.Field): ...
class UntaggedUnion(marshmallow.fields.Field): ... and to pick a default for what |
I think this is a feature which is missing from desert, so I really like where this is going, but I can't help thinking that this is blurring the lines between desert and marshmallow. Or to put it another way: Shouldn't really these fields belong in marshmallow or a variant thereof? My impression is that desert exists as a shim on top of marshmallow to allow for dataclass and attr, but rely on marshmallow for the actual serialization. Won't putting type fields on unions into desert make desert and marshmallow incompatible? What is the intended relationship between desert and marshmallow? |
I'm agnostic to the final decision on where this code should live — I'd be happy with anything that makes for working code. I just wanted to share this link, where as of about a year ago it seems to say that marshmallow has no plans to introduce a union field of any kind in the near future: I think @python-desert's suggested approach would produce field types that are compatible with marshmallow, and in principle could even be contributed upstream (or extracted into a separate marshmallow extension package) if everyone involved is amenable to that. So I think the first priority is getting to a working and stable implementation, since the paths from there onward don't seem to be too bad.
This looks great to me, and I've given the choice of default some thought. I'd like to make a case for adjacently-tagged to be the default. I think the default choice should always work, with minimum tweaking of configuration and other "if"s and "but"s.
The object inside This situation isn't possible with adjacently-tagged unions. There, the "type" key can only have one value, and the "value" key only holds the data for one value. While there may be additional, unexpected keys in the dict, this is also true of externally-tagged unions, so the same error-handling code has to exist in both cases. But the "multiple-variants" error case is impossible by design, and I'd argue that's a win we'd like for our default choice. So long as we pick an always-safe choice for the default, and allow users to consciously change that choice (if the new choice is also safe for their dataclasses) to get advantages like more compact representation, I think that will result in happy users. My team and I definitely will be happy 😄 |
It seemed that if I finished it up that Bachmann1234/marshmallow-polyfield#34, which added adjacently tagged as an explicit union option, would get accepted. I would expect they would accept the rest. adamboche/python-marshmallow-union#27 refers over to |
It could make sense to put these changes in a more narrowly marshmallow-focused library.
Nothing particularly jumps out at me. What's the incompatibility you're thinking of? |
Let's talk interface. thing: Union[A, B] will be treated like thing: Union[A, B] = AdjacentlyTaggedUnion(...) What's the signature? Edit: removed comments that didn't make sense. |
So yeah, there's the super flexible 'callable that decides how to deserialize' but that feels possibly excessively flexible to me. Though perhaps it is just a case of giving a default implementation that can be passed that doesn't involve that flexibility. I suspect that most of the time the user should have a registry of classes and their tags that are used to go back and forth between the two. That could be specific to the particular union but I would think that often it could just be 'global'. In graham I specify that in the class decorator. The registry could even allow for colliding names and only have an issue if the colliding classes end up used in an ambiguous place. Might want to have that be an opt-in loosening of the rules and error by default though. So signature? Let's see if refreshing myself as to what I did over in polyfield ends up creating a useful summary for others. You need (with nominal defaults as reference):
I reduced this to a class to schema mapping and a class to tag mapping that acted as an override for the default So if there's a registry, a parameter with it would simply be the signature for the fields? Or maybe leave it as a class-to-tag and a tag-to-class callable where normally people would specify |
There may be some room for reasonable defaults and avoiding registries (especially global ones) in many common cases. While classes' I have a mild preference for using |
As I mentioned above, I think we want to diverge from polyfield in that all of those "schema"s in @altendky's bullet list should be fields for us. That way we can support non-dataclass union entries. Do we just trust that python type<->field, python type <-> tag, field <-> tag are all one-to-one correspondences (bijective)? If they are, then it doesn't matter whether the tag comes from the python type or the field, but otherwise maybe it does. There are a bunch of "what if there are two _ for a given _" cases which I haven't thought through. Is it the type's job to determine the tag or the field's job? |
If my understanding of the bijective correspondences question is correct, it seems to me that it may be possible to locally (i.e. on a per-union-field basis) at schema construction time verify the bijective property for the default (type-provided) mappings, or request that the user provide a custom (i.e. field-provided + also verified) one otherwise. Does that sound reasonable, or am I missing some subtlety in the question? I admit I'm a newbie when it comes to marshmallow and especially desert — apologies for my ignorance. |
Suppose we're doing this for marshmallow independent of the dataclass logic. Do we allow AdjacentlyTaggedUnion([AField(), BField()], ...) or only AdjacentlyTaggedUnion({'A': AField(), 'B': BField()}, ...) |
Eh let's start with the latter, we can always add more stuff if we want to. |
A concrete implementation of |
So we do have one over in the polyfield PR, but yes, I was considering needing some code to help me think. If someone else wants to take a stab, go for it. I don't know offhand if it will be hours or days or... until I get to it. Otherwise, yeah, I'll probably get to it at some point. As to the registry and having a default, a piece of that idea is keeping all of the control in one place. If you can pass in a thing that has some logic and then the union field also provides some default logic then you have to go two places to figure out what is going on. If you put the default logic in a default registry then you only look at the registry. Also, the 'this set of "mappings" is consistent' check would be the responsibility of the registry object so the two callables passed to the field wouldn't be a thing the field would validate. It either works well or doesn't. |
Mmm, not the 'default registry' being the 'one place' but that using a registry allows it to implement any logic and any validity checks it wants and it is the registry's responsibility to do the mapping. But yeah, I can't completely convince myself that a global registry is an evil thing for someone to want to use. If we don't provide one they'll just make it themselves. But yeah, using a registry for this purpose and having a global one are still two separate discussions I think. |
I guess there's also the fact that we'll have multiple fields with, for 'normal' cases, the same needs in converting between the tag and the field. So that mapping functionality and any default behavior should in some way be separated from the fields. Of course, a registry class is only one of the ways to do that. |
Haven't reflected on #94 yet myself but it's code and it passes two whole tests. I am tempted to try to do this without custom stuff. Might be able to pull it off with pluck or such. |
I tried to go through #94 but I suspect my unfamiliarity with Since I understand this is early WIP code, I don't think I got a good sense of what the "real world" use of this code would look like. I found it difficult to untangle the test fixtures and test setup from the "the user would write this in the real world" bits that would allow me to have an informed opinion about the API and implementation. |
Yeah, this is both very early WIP and also a thing that may well end up in another library that desert would use. There is only one type presently tested and it was just making sure that when serialized it added the adjacently tagged layer and stripped it when deserializing. The registry is more or less the 'union', in a sense. It contains the type/tag->field mappings. I do feel like this implementation requires us knowing about some marshmallow 'internals' (that I'm presumably not handling completely now) and as such I do plan to spend some time trying to use 'higher level' marshmallow mechanisms. Maybe pluck, maybe method, maybe function... I need to look over them. https://marshmallow.readthedocs.io/en/stable/custom_fields.html |
@altendky |
I hit up against the question of how to handle objects to be serialized that have the same type but different contents. The committed examples are a pair of lists, one with strings and another with integers. The registry I wrote is dual keyed by the type and the tag with the field used for serialization as the value. Not a So, custom classes are relatively straightforward. You don't usually want to differentiate the serialization of |
Apologies for being MIA for the last week. If I understand correctly, the concern is that we may not know how to properly deserialize a list because we aren't sure what the type inside it is? I'm having a bit of trouble visualizing a situation where this arises. In my mind, we always have a schema, so if we are deserializing a
Similarly, you mention the concern around differentiating between lists and tuples, even though they are serialized the same. In my mind, the schema tells us whether the type we are trying to deserialize is Here's the top-level thing I'd like to make sure we're on the same page about: in the API we're all talking about, when deserializing we already expect to have the schema of the object we are attempting to deserialize, right? Put differently, we're aiming for something more akin to |
I'm a bit tired at the moment but I think you are thinking about deserialization, which seems like obviously the troublesome part, but I was talking about serialization. The schemas don't tell us what we have when deserializing, they tell us what we might have. The data has to tell us what we do have and will do so via the tags. But yes, we will have a field (not schema... I think...) for each tag that we need to deserialize 'from' a union. If we don't, we would raise an exception. This is what being explicit gets us and is certainly nice. But for serialization... it is maybe a bit different. What I have implemented presently doesn't look 'inside' anything when serializing. As such, As to the tuple vs. list question, that is again about serialization. If we have an attribute that is hinted But again, these are the most trivial of cases. Certainly we don't have to support every corner but I am trying to express the ones I think of so we can consider if there is a clean universal solution. Or at least decide on a clear and understandable way to handle them even if just raising an exception in some cases. |
|
For the most part Marshmallow doesn't validate anything on serialization. >>> marshmallow.fields.List(marshmallow.fields.Int()).serialize('x', {'x': (1,2,3)})
[1, 2, 3] |
But we need to choose how to serialize it. Via the field for |
Ok yeah. |
We can do this with two functions.
Alternatively, we could skip the "union member type" step altogether and just map directly "python object -> marshmallow field type/instance". I think this is harder but might be useful in some situations. Focusing on the first option, in the simple case where
it can be just: foo: Union[int, List[str]] This suggests we might want to do the registry thing to add more type -> field pairs. Assuming we won't always be able to automatically determine the best field type, user needs a way to explicitly specify (what amounts to) a function This could be done a few ways, e.g. in "parallel lists" style, using the default field type for foo: Union[int, List[str]] = \
Explicit.from_list([desert.DefaultField(), fields.List(fields.Str())]) In mapping style, using the defaults without the need for placeholders: foo: Union[int, List[str]] = Explicit.from_mapping({int: fields.Int()}) In function style: def dispatch(x):
# XXX Really the fields should be instantiated only once.
return fields.Int() if isinstance(x, int) else fields.List(fields.Str())
foo: Union[int, List[str]] = Explicit.from_function(dispatch) |
@altendky you are completely right, I was thinking about deserialization — now this all makes sense. Thank you for explaining 👍 @python-desert your "two functions" suggestion sounds good to me. I like the mapping style best, and the parallel-lists approach second best. I'm a bit worried that the function style might provide too much freedom, in a sense, and may end up being a foot-gun. Besides, the mapping style technically allows arbitrary functions via a custom I'd maybe even argue that unions where types are not disjoint (i.e. it's not clear which union variant a particular value belongs to) could even be a definition-time error and blow up long before serializing anything. However, I acknowledge that I might not have a full understanding of all use cases, so I think the proposed "user provides a function to disambiguate" approach is sound. |
So the existing registry in #94 ( I had started another registry to attempt to resolve the issues I mentioned, but got distracted. The |
I believe "object to hint" can be done by a predefined mapping with fallback to the if x in mapping.keys():
return mapping[x]
[hint] = {t for t in types if pytypes.is_of_type(x, t)}
return hint |
Predefined Are you thinking changing to the following by breaking class AdjacentlyTaggedUnion(marshmallow.fields.Field):
def __init__(
self,
*,
hint_from_object: typing.Callable[[typing.Any], Hint],
field_from_hint: typing.Callable[[Hint], TypeTagField],
from_tag: typing.Callable[[str], TypeTagField],
**kwargs,
): |
https://github.com/marshmallow-code/marshmallow-oneofschema Maybe a relevant thing to look at. Just glanced quickly. |
If you poke at pytypes, be wary. Present PyPI doesn't support 3.7+. Git master does better. |
Next funny business to consider. A |
So I think at some level it's simply going to be possible to have non-distinguishing hints. I guess it's up to the registry (the thing that figures out the hint to use etc, whatever) to decide if it should distinguish or not and even though we ought to provide a good option that'll be entirely configurable. The above could be handled by using an |
So, where are we at and what do we need to decide? #94 presently has a I think some above discussion was considering that maybe there should be three callables with an extra intermediate step? The next layer around The tests give each of those tagged union callables with methods from a It is presently failing due to So, where are we at and what do we need to decide? (sorry, couldn't help being a bit WET here |
I might work on this a bit. I've at least reread the discussion so far. Just mentioning in case anyone has updates on the topic from the past year. |
Sometimes it would be good to store a type identifier in union fields. This would be especially true for sparsely populated data and unions containing
str
(described below).Consider:
We then create some instances:
Theoretically that would serialize to (it doesn't seem to so there will presumably be another issue report):
Which leaves us wondering what any of the things are.
On the flip side, if we take an example in the tests where there is a field
t.Union[int, str]
then adjust it to include a custom class afterstr
such ast.Union[int, str, MyClass]
thenstr
will 'always' work andMyClass
instances will be serialized to astr
such as'MyClass()'
(well, for anattrs
class with such arepr()
anyways).tl;dr,
marshmallow-union
has some pretty significant limitations. I got myself confused and started creating a solution over inmarshmallow-polyfield
Bachmann1234/marshmallow-polyfield#34 which would introduce an extra layer to the serialization and include a string to designate the type of that element. No more guessing.Perhaps
desert
is interested in usingmarshmallow-polyfield
for that feature? Perhapsdesert
would rather the feature be added tomarshmallow-union
?In the mean time I'll try to find time to work my way through reporting and possibly fixing some other issues.
The text was updated successfully, but these errors were encountered: