-
Notifications
You must be signed in to change notification settings - Fork 15.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change ByteString to use memory and support unsafe create without copy #7645
Change ByteString to use memory and support unsafe create without copy #7645
Conversation
Interesting idea. I think we should come up with a more complete design that also covers how to efficiently parse messages that contain lots of (potentially large) byteStrings. Btw, Java has a concept of "aliasing" when parsing
|
Before doing that I want to make sure that we're ok with the concept of a non-copied ByteString. I saw a comment from @jskeet on a PR that made "Unsafe" internal that content should always be copied. |
I'm pretty nervous about this, because it means I'd definitely want the protobuf team to weigh in on this, e.g. @haberman and @anandolee. |
What scenarios is an immutable API enabling and who are we trying to protect here? When it comes to security, the POV on the .NET team is if someone can execute code on the server there it is already game over. For example, when we expose I view |
All the normal benefits of immutability. I can reason about my code with confidence that the [
Whereas I view it more like This isn't a matter of whether someone can do things maliciously - it's more about what I can rely on. If I call a method from a library that returns a |
The documentation on
If someone is modifying the memory after giving the |
Again, would you feel happy introducing
They may be misusing it deliberately, or they may be doing so accidentally. That's the problem: currently, I can trust that a |
Your trust is based on people using the public API and not using reflection. What about trusting UnsafeFromBytes will be used correctly 😄 I can have an API that returns I'm not familiar with all the arguments why string must always be mutable. That it is commonly used as a key in hashtables perhaps? If you'd like I could ask a BCL expert to see whether those same restrictions need to apply to something like ByteString. |
And again, that's true for strings as well. You can use unsafe code to mutate a string - but it's obvious when you do that, and it would be malicious. The public API protects against accidental mutation at the moment. Strings being immutable makes them much, much easier to work with. I can validate them as a precondition, and know that the precondition holds later on. Immutable types are simply easier to reason about. Sure, it also means they can be used as keys in hashtables etc, but I regard that as less important than the ability to reason about them. (As an example of how it's acknowledge to be a problem, note how
Again, I'd say that @haberman and @anandolee are the ultimate arbiters here, but the more comments I've written here, the more strongly opposed to it I am. |
We're only talking about giving the creator of the |
Agreed, but that's still pretty significant IMO. Anyway, I've made it pretty clear I'd prefer not to have this. I don't think we're going to agree on this, so let's see what @haberman and @anandolee think. |
Java Protobuf has this - https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/UnsafeByteOperations.html I'm happy renaming |
Okay, if there's precedent in Java then I'm significantly less worried by it. I still think it can lead to really nasty and hard-to-diagnose situations (as described in that Javadoc) but I don't expect it to be any worse in .NET than in Java. |
Updated to mirror Java's |
5821555
to
f823caa
Compare
Benchmark:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the code changes look fine (I added some comments), but I'm concerned about whether this is actually useful. Along the lines of what's mentioned here #4206 (comment), there are basically these situations:
-
NOT USEFUL: deserializing from a IBufferReader / Span- useful because the buffer is not owned by us and we don't have control over when it gets released
-
NOT USEFUL AS IS: deserializing from a buffer we own: if the buffer we own is represented by a ByteString, we could use potentially use aliasing to reference the underlying buffer of the top level ByteString and create instances of ByteString without copying - the problem is that we'd probably need a change to the parse to accommodate this. This PR by itself wouldn't suffice.
-
USEFUL BUT LEADS TO QUESTIONABLE USAGE PATTERNS: serializing a "manually created" byte string: we can create byte strings from a rented buffer when creating messages we want to serialize, and then make sure the message is serialized before we release the rented buffer. The issue is that this pattern feels clumsy and relies on operations being done in the right order, otherwise we can easily end up with corrupt data. While the performance gain from doing this might be quite tempting in some scenarios, it feels more like a hack than a good practice to do this so I don't think this is something I would recommend to the users. (and if the technique enabled by this PR is not something we feel is worth recommending, then why do it at all?).
I think overall I'm against accepting this PR unless we can come up with a list of use cases that lead to clear performance gains and can be implemented with "clean" code snippets that don't feel too hacky or fragile and which we would feel confident recommending to regular users (I know this is relative but the code example presented in the PR description as "After" doesn't feel that way to me). I don't think this is worth the work if the goal is to only unlock performance gains for a small group of power users that are willing to resort to hacks in order to get there. Instead I'd like to get to some solution that allows most users to benefit from a performance gain and achieve that without making their code significantly more complex/fragile.
Btw, in connection with gRPC, the "After: rented array" pattern from the PR description would be even more complicated because when writing messages, one needs to first await the WriteAsync operation (or the call) before releasing the rented buffer which further complicated the logic.
public sealed class ByteString : IEnumerable<byte>, IEquatable<ByteString> | ||
{ | ||
private static readonly ByteString empty = new ByteString(new byte[0]); | ||
|
||
#if GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in the comments, looks like GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY is now always true.
I think the right order of actions is to first perform cleanup and then add new functionality.
Can we first have a PR that removes GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY for the C# codebase altogether (I think now it's the right time to do that) and only then modify this PR so we only have one version of the logic in ByteString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this PR is now rebased on top of it.
/// </list> | ||
/// </remarks> | ||
[SecuritySafeCritical] | ||
public static class UnsafeByteOperations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I feel like we don't have to necessarily adopt the naming used in protobuf Java (IMHO the naming java uses isn't very intuitive), but I don't feel strongly about this.
The problem I have with java naming is that "UnsafeByteOperations" sounds very general (and sounds more like it would contain functionality for casting bytes into values of a given type etc.), but in reality it's only about unsafe byte string operations (which seems to be a very narrow subset of "byte operations"). But as said, I'm not feeling too strongly about this and perhaps I'm misunderstanding the spririt of the "UnsafeByteOperations" name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I belive, that moving UnsafeWrap
method to previously existing Unsafe
class inside ByteString
is good way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested classes aren't typical in most public APIs.
Having advanced methods on their own type is a common pattern in .NET APIs. For example MemoryMarshal
and Unsafe
in .NET Core have advanced methods for working with types and memory.
I think this name is fine and the consistency with Java, which also has ByteString, is nice.
/// </summary> | ||
internal static class Unsafe | ||
internal static ByteString AttachBytes(byte[] bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why rename this method and remove the "Unsafe" class? creating a ReadOnlyMemory from the byte array doesn't make this method any safer
because the caller can still modify the original array and we need to trust him not to do so which is why the method was marked as "Unsafe" (and it should continue being so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't rename the method. I completely removed ByteString.Unsafe.FromBytes
. It was internal and nothing used it.
ByteString.AttachBytes(byte[])
already existed.
} | ||
|
||
[Benchmark] | ||
public ByteString UnsafeWrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I don't think this benchmark class brings much value the way it is. What is does is it basically measures how much time it takes to copy a byte array of given size (which takes long time and does allocate memory but everyone knows that) and the time it takes to create a ReadOnlyMemory from bytes. Comparing these two numbers isn't very useful beyond the obvious "first one is very slow and the latter is very fast".
What would be more interesting is comparison of the original ByteString.Unsafe.FromBytes vs new ByteString(ReadOnlyMemory<>) and the original CreateFrom vs the new CreateFrom (which has now a different implementation) - at least we would have a comparison of whether this PR causes any regressions (if so, they would probably be minor, but we should check anyways).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing these two numbers isn't very useful beyond the obvious "first one is very slow and the latter is very fast".
Well the change is basically from doing lots of work to doing nothing. Before creating a byte string involved creating new array, copying all data, garbage collecting array. Now a byte string can be created by using data that you already have available: just reference the existing data.
All this change impacts is creating a ByteString that you then want to serialize. This is an alternative to ByteString.CopyFrom
. Comparing ByteString.CopyFrom
and UnsafeByteOperation.UnsafeWrap
is the most accurate comparison. Slow to fast is why people want this.
What would be more interesting is comparison of the original ByteString.Unsafe.FromBytes vs new ByteString(ReadOnlyMemory<>) and the original CreateFrom vs the new CreateFrom (which has now a different implementation)
Nothing used ByteString.Unsafe.FromBytes
. I don't know what you mean by CreateFrom. There isn't an API called that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could update the benchmark to create a ByteString and then serialize it if you want, but the time to serialize it will the the same. It will just add noise onto the original number that compares creating ByteString.
/// </list> | ||
/// </remarks> | ||
[SecuritySafeCritical] | ||
public static class UnsafeByteOperations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I belive, that moving UnsafeWrap
method to previously existing Unsafe
class inside ByteString
is good way to go.
@@ -411,23 +471,55 @@ public bool Equals(ByteString other) | |||
/// </summary> | |||
internal void WriteRawBytesTo(CodedOutputStream outputStream) | |||
{ | |||
#if GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY | |||
if (MemoryMarshal.TryGetArray(bytes, out ArraySegment<byte> segment)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid possible redundant allocations I suggest adding span-based WriteRawBytes
to CodedOutputStream
(internally ref Span<byte>
is being used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is internal and nothing used it. I've remove WriteRawBytesTo entirely.
99bda0d
to
9b4fb0b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments but otherwise I think this is looking fine.
I'm not a fan of this new API, but it seems there's demand for it and there's no impact on users that don't want to use this.
9b4fb0b
to
e794919
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@jtattermusch - when can we expect a gRPC release with this feature? |
In the next release (we missed the 3.14 release branch cut±) - unfortunately I'm not sure when that is going to be. @acozzette does protobuf team have a release schedule for the open source releases yet? |
We don't have a set schedule for releases, but they usually happen once every quarter or so. |
Glad to see the unsafe helper is making its way back. @jskeet slapped me for asking for it a few years ago :) |
+1, I would love to know is there is a release date for this feature! |
@jonso4, 3.15.0-rc1 is out with the |
gRPC won't use this API. It is used by you in your apps when instantiating your Protobuf generated messages. If you're creating large ByteStrings then this might be something you can use to avoid allocations and copies. |
This does seem to break those apps (For example: https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ActivityExtensions.cs#L346) they will have to update their code (that line creates a |
And that's why you don't use reflection to call private APIs. The bad news is they're broken, the good news is they can use this new public API. |
Fixes #4206
Change
ByteString
to useReadOnlyMemory<byte>
field, and provide overload that allowsByteString
to be created without allocating and copying memory.Before:
After:
I usedByteString.Unsafe.FromBytes(bytes)
because it was already there. It was internal and never used. Happy to get rid of the nested class and change toByteString.UnsafeFromBytes(bytes)
. Unsafe prefix is the standard in .NET runtime, e.g. https://source.dot.net/#System.Private.CoreLib/Assembly.cs,8536a7569220f81f,referencesI went ahead and changed it to
UnsafeByteOperations.UnsafeWrap(bytes)
.I noticed
GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY
is now always true.#ifdefs
can be removed.@jtattermusch @jskeet