RUST-1132 Implement `DeserializeSeed` for owned and borrowed raw documents #433

isabelatkinson · 2023-09-28T22:41:02Z

This PR adds a new SeededVisitor type to deserialize into raw documents/arrays using a single buffer. There are no tests added as the existing corpus/serde tests already provide coverage for deserializing into these types. I did update the test runners in this library and in the driver (for which I will make a separate PR) to deserialize test JSON into raw documents for good measure, which, in addition to adding a bit more testing of this implementation, should be a small performance improvement for spec tests.

Some very basic benchmarking shows a 50% speed improvement when deserializing a JSON object with 80 layers of object nesting into a RawDocumentBuf. I will follow up with some more thorough benchmarking.

EDIT: Further benchmarking against the deep_bson.json and flat_bson.json files generated for the BSON microbenchmarks consistently shows that this new implementation is about three times faster than the existing implementation when deserializing those files into a RawDocumentBuf.

isabelatkinson · 2023-09-28T22:44:05Z

src/extjson/models.rs

@@ -370,3 +372,32 @@ impl Undefined {
        }
    }
 }
+
+
+#[derive(Debug, Deserialize)]


these are copied as-is from the original raw bson visitor

isabelatkinson · 2023-09-28T22:47:32Z

src/raw/serde/bson_visitor.rs

+};
+
+/// A visitor used to deserialize types backed by raw BSON.
+pub(crate) struct OwnedOrBorrowedRawBsonVisitor;


The contents in this file are mostly copied as-is from src/raw/serde.rs; I will point out the differences below.

isabelatkinson · 2023-09-28T22:49:30Z

src/raw/serde/bson_visitor.rs

+        .into())
+    }
+
+    fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>


This method now uses the seeded visitor to deserialize a sequence.

isabelatkinson · 2023-09-28T22:50:24Z

src/raw/serde/bson_visitor.rs

+                let doc = RawDocument::from_bytes(bson).map_err(SerdeError::custom)?;
+                Ok(RawBsonRef::Array(RawArray::from_doc(doc)).into())
+            }
+            _ => {


This case also uses the seeded deserializer. The extjson logic above is the same (except for the models being moved to a different module).

isabelatkinson · 2023-09-29T20:13:24Z

src/raw/serde/seeded_visitor.rs

+    fn finish_document(&mut self, index: usize) -> Result<(), String> {
+        self.buffer.push_byte(0);
+
+        let length_bytes = match i32::try_from(self.buffer.len() - index) {


I opted only to validate the cast to i32 here despite the fact that some nested types also calculate/append length bytes. If any of those nested lengths overflow, the larger length that includes them here will also overflow, so no need to double-check.

isabelatkinson · 2023-09-29T20:15:39Z

src/raw/serde/seeded_visitor.rs

+                Some(element_type) => element_type,
+                None => {
+                    // Remove the additional key and padding for the element that was not present.
+                    self.buffer.drain(element_type_index..self.buffer.len());


This isn't ideal but SeqAccess does not contain any has_next or required length method, so there's no way to know whether there's another element until next_element_seed is called, which has to happen after the key and element type byte are already appended.

The only way around this I can think of would be to create a temporary buffer for the element type tag, key string, and value, and only append that temp to the main buffer if there actually is one. I'm pretty sure that the way you have it will be better performance-wise, though.

src/raw/serde/seeded_visitor.rs

isabelatkinson · 2023-10-10T23:02:57Z

I've done some investigation into merging the visitors (which I think is a good idea), but I've run into a few problems that make this more complicated than expected. The crux of the issue is that the RawBsonVisitor and SeededVisitor are doing slightly different things: the former can take any value that can be turned into a RawBsonRef variant and represent it as such, but the latter (as currently designed) is only intended to visit maps and sequences at the top-level and treats any other type as a nested value. This difference can be bridged in most cases, but borrowed values present a challenge because of the metadata required in their byte representation within a raw document.

Take, for example, an attempt to deserialize a borrowed string into RawBsonRef. The RawBsonVisitor implementation has no problem doing this because it just creates a RawBsonRef::String with a reference to the string upon the call to visit_borrowed_str. However, if the SeededVisitor is used instead, a call to visit_borrowed_str will append length bytes and a null-terminator byte, which requires converting the string into an owned value.

This is definitely not an intractable problem. The SeededVisitor could differentiate between top-level and nested values for variants that can borrow data (as it currently does for binary) and return bytes/byte slices as such, and then the byte parser that interprets the bytes as RawBson/RawBsonRef could have branching logic based on whether the given value is top-level or nested within a document. My instinct here, though, is that implementing all of that special logic (plus the indirection we'd be adding by needing to go through bytes) isn't much of an improvement on the duplicated visitor situation that exists in this PR as-is. I'm also generally hesitant to add more logic in the BSON library that constructs or picks apart bytes for the sake of correctness and maintainability.

Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals?

abr-egn · 2023-10-11T14:50:58Z

Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals?

That makes sense to me. Thank you for taking the time to investigate this path!

abr-egn

LGTM!

abr-egn · 2023-10-11T15:30:03Z

src/raw/serde/seeded_visitor.rs

+                Some(element_type) => element_type,
+                None => {
+                    // Remove the additional key and padding for the element that was not present.
+                    self.buffer.drain(element_type_index..self.buffer.len());


The only way around this I can think of would be to create a temporary buffer for the element type tag, key string, and value, and only append that temp to the main buffer if there actually is one. I'm pretty sure that the way you have it will be better performance-wise, though.

kevinAlbs

LGTM.

I left a suggestion to consider replacing CowByteBuffer to simplify. But I do not consider it a necessary change to merge.

src/raw/serde/seeded_visitor.rs

isabelatkinson added 9 commits September 28, 2023 16:39

strings and maps

5bb1d53

numbers

22e8bad

basic arrays

6014b38

extjson

cacb377

integrate with raw bson visitor

d2e464e

format

c9b7b67

cleanup

a29f1da

fix bug, update test runner

beb3df6

fix extjson

bf0686a

isabelatkinson marked this pull request as ready for review October 2, 2023 20:09

isabelatkinson commented Oct 2, 2023

View reviewed changes

isabelatkinson requested review from abr-egn and kevinAlbs October 2, 2023 20:42

abr-egn approved these changes Oct 11, 2023

View reviewed changes

isabelatkinson mentioned this pull request Oct 11, 2023

Stackoverflow via raw_deserialize fuzz target found by oss-fuzz #385

Open

kevinAlbs approved these changes Oct 13, 2023

View reviewed changes

src/raw/serde/seeded_visitor.rs Show resolved Hide resolved

isabelatkinson merged commit 283ecb3 into mongodb:main Oct 18, 2023
5 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUST-1132 Implement `DeserializeSeed` for owned and borrowed raw documents #433

RUST-1132 Implement `DeserializeSeed` for owned and borrowed raw documents #433

isabelatkinson commented Sep 28, 2023 •

edited

Loading

isabelatkinson Sep 28, 2023

isabelatkinson Sep 28, 2023

isabelatkinson Sep 28, 2023

isabelatkinson Sep 28, 2023

isabelatkinson Sep 29, 2023

isabelatkinson Sep 29, 2023

abr-egn Oct 11, 2023

isabelatkinson commented Oct 10, 2023 •

edited

Loading

abr-egn commented Oct 11, 2023

abr-egn left a comment

abr-egn Oct 11, 2023

kevinAlbs left a comment

RUST-1132 Implement DeserializeSeed for owned and borrowed raw documents #433

RUST-1132 Implement DeserializeSeed for owned and borrowed raw documents #433

Conversation

isabelatkinson commented Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isabelatkinson commented Oct 10, 2023 • edited Loading

abr-egn commented Oct 11, 2023

abr-egn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinAlbs left a comment

Choose a reason for hiding this comment

RUST-1132 Implement `DeserializeSeed` for owned and borrowed raw documents #433

RUST-1132 Implement `DeserializeSeed` for owned and borrowed raw documents #433

isabelatkinson commented Sep 28, 2023 •

edited

Loading

isabelatkinson commented Oct 10, 2023 •

edited

Loading