Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUST-1132 Implement DeserializeSeed for owned and borrowed raw documents #433

Merged
merged 9 commits into from
Oct 18, 2023

Conversation

isabelatkinson
Copy link
Contributor

@isabelatkinson isabelatkinson commented Sep 28, 2023

This PR adds a new SeededVisitor type to deserialize into raw documents/arrays using a single buffer. There are no tests added as the existing corpus/serde tests already provide coverage for deserializing into these types. I did update the test runners in this library and in the driver (for which I will make a separate PR) to deserialize test JSON into raw documents for good measure, which, in addition to adding a bit more testing of this implementation, should be a small performance improvement for spec tests.

Some very basic benchmarking shows a 50% speed improvement when deserializing a JSON object with 80 layers of object nesting into a RawDocumentBuf. I will follow up with some more thorough benchmarking.

EDIT: Further benchmarking against the deep_bson.json and flat_bson.json files generated for the BSON microbenchmarks consistently shows that this new implementation is about three times faster than the existing implementation when deserializing those files into a RawDocumentBuf.

@isabelatkinson isabelatkinson marked this pull request as ready for review October 2, 2023 20:09
@@ -370,3 +372,32 @@ impl Undefined {
}
}
}


#[derive(Debug, Deserialize)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are copied as-is from the original raw bson visitor

};

/// A visitor used to deserialize types backed by raw BSON.
pub(crate) struct OwnedOrBorrowedRawBsonVisitor;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents in this file are mostly copied as-is from src/raw/serde.rs; I will point out the differences below.

.into())
}

fn visit_seq<A>(self, seq: A) -> Result<Self::Value, A::Error>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method now uses the seeded visitor to deserialize a sequence.

let doc = RawDocument::from_bytes(bson).map_err(SerdeError::custom)?;
Ok(RawBsonRef::Array(RawArray::from_doc(doc)).into())
}
_ => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case also uses the seeded deserializer. The extjson logic above is the same (except for the models being moved to a different module).

fn finish_document(&mut self, index: usize) -> Result<(), String> {
self.buffer.push_byte(0);

let length_bytes = match i32::try_from(self.buffer.len() - index) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted only to validate the cast to i32 here despite the fact that some nested types also calculate/append length bytes. If any of those nested lengths overflow, the larger length that includes them here will also overflow, so no need to double-check.

Some(element_type) => element_type,
None => {
// Remove the additional key and padding for the element that was not present.
self.buffer.drain(element_type_index..self.buffer.len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't ideal but SeqAccess does not contain any has_next or required length method, so there's no way to know whether there's another element until next_element_seed is called, which has to happen after the key and element type byte are already appended.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way around this I can think of would be to create a temporary buffer for the element type tag, key string, and value, and only append that temp to the main buffer if there actually is one. I'm pretty sure that the way you have it will be better performance-wise, though.

src/raw/serde/seeded_visitor.rs Show resolved Hide resolved
@isabelatkinson
Copy link
Contributor Author

isabelatkinson commented Oct 10, 2023

I've done some investigation into merging the visitors (which I think is a good idea), but I've run into a few problems that make this more complicated than expected. The crux of the issue is that the RawBsonVisitor and SeededVisitor are doing slightly different things: the former can take any value that can be turned into a RawBsonRef variant and represent it as such, but the latter (as currently designed) is only intended to visit maps and sequences at the top-level and treats any other type as a nested value. This difference can be bridged in most cases, but borrowed values present a challenge because of the metadata required in their byte representation within a raw document.

Take, for example, an attempt to deserialize a borrowed string into RawBsonRef. The RawBsonVisitor implementation has no problem doing this because it just creates a RawBsonRef::String with a reference to the string upon the call to visit_borrowed_str. However, if the SeededVisitor is used instead, a call to visit_borrowed_str will append length bytes and a null-terminator byte, which requires converting the string into an owned value.

This is definitely not an intractable problem. The SeededVisitor could differentiate between top-level and nested values for variants that can borrow data (as it currently does for binary) and return bytes/byte slices as such, and then the byte parser that interprets the bytes as RawBson/RawBsonRef could have branching logic based on whether the given value is top-level or nested within a document. My instinct here, though, is that implementing all of that special logic (plus the indirection we'd be adding by needing to go through bytes) isn't much of an improvement on the duplicated visitor situation that exists in this PR as-is. I'm also generally hesitant to add more logic in the BSON library that constructs or picks apart bytes for the sake of correctness and maintainability.

Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals?

@abr-egn
Copy link
Contributor

abr-egn commented Oct 11, 2023

Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals?

That makes sense to me. Thank you for taking the time to investigate this path!

Copy link
Contributor

@abr-egn abr-egn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Some(element_type) => element_type,
None => {
// Remove the additional key and padding for the element that was not present.
self.buffer.drain(element_type_index..self.buffer.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way around this I can think of would be to create a temporary buffer for the element type tag, key string, and value, and only append that temp to the main buffer if there actually is one. I'm pretty sure that the way you have it will be better performance-wise, though.

Copy link
Collaborator

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I left a suggestion to consider replacing CowByteBuffer to simplify. But I do not consider it a necessary change to merge.

src/raw/serde/seeded_visitor.rs Show resolved Hide resolved
@isabelatkinson isabelatkinson merged commit 283ecb3 into mongodb:main Oct 18, 2023
5 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants