Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid lossy buffering in #[serde(flatten)] #2186

Open
sfackler opened this issue Mar 9, 2022 · 12 comments
Open

Avoid lossy buffering in #[serde(flatten)] #2186

sfackler opened this issue Mar 9, 2022 · 12 comments

Comments

@sfackler
Copy link
Contributor

sfackler commented Mar 9, 2022

The Deserialize implementation of a type with a field annotated #[serde(flatten)] currently buffers up all of the unknown map entries in memory, and then deserializes the flattened value from them at the end. This approach has a couple of downsides:

  1. It causes the deserializer to allocate when that would normally not be the case.
  2. It confuses "smart" deserializers - things like serde_ignored_value or serde_json's error position tracking don't work properly.
  3. It can lose out on special casing in deserializers based on hinting.

Fortunately, there's alternate approach that avoids all of these issues! Instead of buffering, we can create a custom MapAccess which intercepts fields that we want to handle in the top-level type, and then returning the remainder to the nested deserialize implementation: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=96546d7ba882b70042268a0b48de6286

@jonasbb
Copy link
Contributor

jonasbb commented Mar 9, 2022

This is a very interesting idea.
I have a couple of remarks / incompatibility observations. Maybe some of them can be resolved.

Currently, the Foo type ultimately decides if a field name is known or unknown. This leads to an incompatibility currently, if deny_unknown_fields is added to Foo. With the normal flatten extra fields will be ignored, since ExtendedFoo allows them.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=4e249bb1961ff44945606ed6bdcdab7d

I think that can be fixed by adding an Ignore variant to ExtendedFooFields, similar to the current derive. An unknown field error from the seed of ExtendedFooFieldsVisitor must be converted into the Ignore variant, but all other error likely should remain. Inspecting the errors might be complicated, since they are very opaque.

It is unclear to me how this could be extended to more than one flattened field.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=7451b4980894d2ddda33b7ce3fc0aa9c
An additional complication here is the fact, that the same value can be deserialized multiple times. A map visitor (i.e., Value) will not consume the fields from the internal buffer, but a struct visitor (i.e., Sub) will. This leads the second Value to share all fields with the first except those consumed by the Sub visitor.

Output
[src/main.rs:27] d = Data {
    f0: Object({
        "x": String(
			"foo",
        ),
        "y": String(
            "bar",
        ),
        "z": String(
            "baz",
        ),
    }),
    a: 123,
    f1: Sub {
        y: "bar",
    },
    b: 456,
    f2: Object({
        "x": String(
            "foo",
        ),
        "z": String(
            "baz",
        ),
    }),
}

@sfackler
Copy link
Contributor Author

sfackler commented Mar 9, 2022

An unknown field error from the seed of ExtendedFooFieldsVisitor must be converted into the Ignore variant, but all other error likely should remain. Inspecting the errors might be complicated, since they are very opaque.

We should be able to handle this by making a custom error type used by the MapAccess implementation that intercepts unknown field errors and delegates the rest to the inner MapAccess's error type.

It is unclear to me how this could be extended to more than one flattened field.

Wow, I was not aware that was a thing you could do! I think you are stuck buffering there to allow the key value to be deserialized multiple times unfortunately. We could continue using the old logic in that case.

@sfackler
Copy link
Contributor Author

sfackler commented Mar 9, 2022

Hmm - we can't actually filter the errors since the error will terminate the deserialization of the inner value entirely...

EDIT: Ah - we may be able to take an approach something like what FlatStructAccess does of remembering the delegate's field list and checking that in our ExtendedFooFieldsSeed.

@sfackler
Copy link
Contributor Author

sfackler commented Mar 9, 2022

On the other hand, the docs on serde.rs do explicitly state that flatten isn't compatible with deny_unknown_fields: https://serde.rs/field-attrs.html#flatten.

@sfackler
Copy link
Contributor Author

Here's a working implementation of that approach: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=b9b5b874c6f4c93d6d82c1bed4e6451e. If you did want to allow unknown fields in ExtendedFoo, we'd codegen the extra Ignored variant for ExtendedFooFields and return that instead of an error in ExtendedFooFieldsSeed.

@Mingun
Copy link
Contributor

Mingun commented Mar 10, 2022

I think, that using a json is a bad example to demonstrate absence of "lossy deserialization", because JSON contains some information about value type. Try some untyped format, like XML (but there are no good XML serde adapter, so using some toy format would be better)

@Lucretiel
Copy link

Lucretiel commented Mar 10, 2022

I coincidentally made a fully working implementation of this this evening before even seeing this thread! You don't need to depend at all on looking for particular deserialize errors / missing_field notifications from the flattened inner type, because you already know ahead of time what the names of the fields are that you're interested in. You can minimize codegen by creating a new, private trait (I called mine KeyCapture):

pub enum KeyOutcome<T> {
    Accepted(T),
    Rejected,
}

pub trait KeyCapture<'de> {
    // Returned from `try_send_key` to communicate state to `send_value`.
    type Token;

    fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result;

    fn try_send_key<E>(&mut self, key: &[u8]) -> Result<KeyOutcome<Self::Token>, E>
    where
        E: de::Error;

    fn send_value<D>(&mut self, token: Self::Token, value: D) -> Result<(), D::Error>
    where
        D: de::Deserializer<'de>;
}

In the struct macro codegen, you implement this trait such that it "rejects" keys that are not part of the struct; these keys can then be transparently forwarded to the type being flattened. Once you have this trait, you can use a series of private, reusable adapter types to deserialize the inner flattened field while simultaneously sending data into the KeyCapture. Some snippets:

// From MapAccess:
    fn next_key_seed<K>(&mut self, mut seed: K) -> Result<Option<K::Value>, Self::Error>
    where
        K: de::DeserializeSeed<'de>,
    {
        // We're extracting a key for the type. Try to send keys to the `KeyCapture`,
        // and the first key that is rejected by the `KeyCapture` are is instead to the
        // `seed`
        loop {
            seed = match self.map.next_key_seed(ExtractKeySeed {
                seed,
                capture: &mut self.capture,
            })? {
                None => {
                    self.drained = true;
                    return Ok(None);
                }
                Some(ExtractKeySeedResult::Deserialized(value)) => return Ok(Some(value)),
                Some(ExtractKeySeedResult::Captured { seed, token }) => {
                    self.map.next_value_seed(ExtractValueSeed {
                        token,
                        capture: &mut self.capture,
                    })?;

                    seed
                }
            };
        }
    }

// From Visitor:
    fn visit_str<E>(self, v: &str) -> Result<Self::Value, E>
    where
        E: de::Error,
    {
        match self.capture.try_send_key(v.as_bytes())? {
            KeyOutcome::Accepted(token) => Ok(ExtractKeySeedResult::Captured {
                seed: self.seed,
                token,
            }),
            KeyOutcome::Rejected => self
                .seed
                .deserialize(v.into_deserializer())
                .map(ExtractKeySeedResult::Deserialized),
        }
    }

Here's what an example codegen might look like for a type called Data2:

struct Data2 {
    key3: String,

    // #[serde(flatten)]
    inner: Data1,
}

impl<'de> de::Deserialize<'de> for Data2 {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
        D: serde::Deserializer<'de>,
    {
        enum Field {
            key3
        }

        #[derive(Default)]
        struct Data2Capture {
            key3: Option<String>,
        }

        impl<'de, 'a> KeyCapture<'de> for &'a mut Data2Capture {
            type Token = Field;

            fn expecting(&self, formatter: &mut std::fmt::Formatter) -> std::fmt::Result {
                write!(formatter, "a Data2 struct")
            }

            #[inline]
            fn try_send_key<E>(&mut self, key: &[u8]) -> Result<KeyOutcome<Field>, E>
            where
                E: de::Error,
            {
                Ok(match key {
                    b"key3" => KeyOutcome::Accepted(Field::key3),
                    _ => KeyOutcome::Rejected,
                })
            }

            #[inline]
            fn send_value<D>(&mut self, token: Self::Token, value: D) -> Result<(), D::Error>
            where
                D: de::Deserializer<'de>,
            {
                match token {
                    Field::key3 => de::Deserialize::deserialize(value).map(|value| {
                        self.key3 = Some(value);
                    }),
                }
            }
        }

        let mut capture = Data2Capture::default();

        let inner = Data1::deserialize(ExtractDeserializer {
            deserializer,
            capture: &mut capture,
        })?;

        let key3 = match capture.key3 {
            Some(key3) => key3,
            None => return Err(de::Error::missing_field("key3")),
        };

        Ok(Self { inner, key3 })
    }

I was additionally able to determine that it is outright impossible (without buffering) to have more than one flattened field. In order to pull that off, you'd need to implement Deserialize for this primitive in a way that spreads the keys out to both fields:

struct MapPair<T1, T2> { pub map1: T1, pub map2: T2 }

You can put together a series of wrapper types and get pretty far, but eventually you end up at this point:

    // We have a pair of visitors, visitor1 and visitor2. We build our own sequence type and 
    // visit it to sequence 1:
    fn next_key_seed<K>(&mut self, mut seed: K) -> Result<Option<K::Value>, Self::Error>
    where
        K: de::DeserializeSeed<'de>
    {
        // `seed` is the receiver for `map1`'s value, and we have a `visitor2` lying around.
        // At this point, we need to loop over the internal sequence type to try to send keys
        // to `visitor2`, and detect an `unrecognized_field` error to feed to the seed.
        // Unfortunately the only interface we have to `visitor2` is `visit_seq`, which must
        // build a completed `map2`. We're at an impasse.
    }

It's basically the same reason that you can't unpack an Iterator of (T1, T2) into a pair of types that are FromIterator<T1>, FromIterator<T2>.

@Lucretiel
Copy link

Completed implementation for consideration: https://github.com/Lucretiel/serde-bufferless

@Mingun
Copy link
Contributor

Mingun commented Mar 12, 2022

I think it is better to fix this generically, not just for the case with one flattened field, because in that case it creates a friction when users realising that small change in definitions of their types will broke deserialization.

To achieve that, changes only on deserializer side is not enough. I work on this problem, and you can see my unfinished work at master...Mingun:flatten. The actual changes begins from 1071b39, the other is a prepare work.

The key idea in that #[derive(Deserialize)] implements additional trait DeserializeEmbed on types that could be flattened. This approach solves many problems:

  • you no longer can add #[serde(flatten)] on fields of non-flattanable types, such as String, because DeserializeEmbed is not implemented for strings (see eac1566)
  • you now deserialize each field only once (not consuming flattened fields from deserializer is a mistake)
  • you can pass flattened field to any level (when you flatten in flatten in flatten...)
  • you no longer need a hack for support Option in untagged enums:

    serde/serde/src/de/mod.rs

    Lines 1672 to 1679 in c3ce2c9

    // Used when deserializing a flattened Option field. Not public API.
    #[doc(hidden)]
    fn __private_visit_untagged_option<D>(self, _: D) -> Result<Self::Value, ()>
    where
    D: Deserializer<'de>,
    {
    Err(())
    }
  • I expect that almost all flatten problems will be solved by this

I do not remember the exact status of readiness of this, but I remember that only one of the enum representation options remained to be implemented (seems untagged representation).

Chosen approach has downsides, but their are not so big:

  • the main downside is that flattenable/non-flattenable is a trait of a type, not a deserialize operation, but DeserializeEmbed derivered on operation basis. This is the fundamental restriction, because I trying to make this invasive as minimal as I possible and I almost approached that: except two scenarios, described below the end-user don't need to change in their code nothing
  • strictly speaking, this is a breaking changes, because your code could stop to compile after this change. But serde anyway doesn't follow semver, so the users should be prepare for that (of course, I supposed to include this only in serde 2 and start to follow semver).

The two scenarios is following:

  1. User implement Deserialize manually and that type is flattened into another type. I think this is very rare case
  2. User try to flatten field of type that cannot be flattened, such of String example above. This is in any case an error in the user code that anyway should be fixed, because currently such types will always deserialized with an error which, obviously, not what the user want. Unfortunately, because DeserializeEmbed implemented only in #[derive(Deserialize)], having only #[derive(Serialize)] on the type still allow you to create an erroneous type. I don't know, is it possible to automatically implement some trait twice from different derives without ambiguity, which is by good required here

@RReverser
Copy link

Ha, reinvented the same wheel (deserializing with just one flat field without buffers) recently and only now discovered this thread. I suppose I should try to switch to serde-bufferless linked above instead.

@RReverser
Copy link

@Lucretiel Any chance you'd want to add a proc-macro to your repo and publish it to crates.io?

@palant
Copy link

palant commented Jun 1, 2024

I’ve taken a similar approach to @Mingun, this issue cannot really be solved without API changes. The deserializers need to expose what fields they can take, and they need to accept fields individually rather than an entire map. I’ve added a DeserializeMap trait extending Deserialize for map-based types as well as the corresponding MapVisitor trait which actually collects the values:

pub trait DeserializeMap<'de>: Deserialize<'de> {
    type Visitor: MapVisitor<'de, Value = Self>;
    fn visitor() -> Self::Visitor;
}

pub trait MapVisitor<'de> {
    type Value;
    fn accepts_field(field: &str) -> bool;
    fn list_fields(list: &mut Vec<&'static str>);
    fn visit_field<D>(self, field: &str, deserializer: D) -> Result<Self, D::Error>
    where
        Self: Sized,
        D: serde::de::Deserializer<'de>;
    fn finalize<E>(self) -> Result<Self::Value, E>
    where
        E: serde::de::Error;
}

Note: While list_fields method doesn’t follow the zero allocation paradigm, it’s only there for nicer error messages.

The Deserialize trait can be implemented generically (as in: blanket implementation would have been possible if not for foreign types) for any data structure implementing DeserializeMap. And a DeserializeMap implementation can delegate processing of individual keys to other DeserializeMap implementations as needed.

Implementation here: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ce2619418043d821e8bc67d1fc8609a4

I’ve got derive macros implemented as well but these are specialized for my use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

6 participants