Change ByteString to use memory and support unsafe create without copy #7645

JamesNK · 2020-06-24T04:35:23Z

Change ByteString to use ReadOnlyMemory<byte> field, and provide overload that allows ByteString to be created without allocating and copying memory.

Before:

var array = new byte[1024 * 1024]; // Large allocation
LoadData(array);
var payload = ByteString.FromBytes(array); // Large allocation and copy
var response = new MyResponse { Payload = payload };
await responseStream.WriteAsync(response);

After:

var array = ArrayPool<byte>.Shared.Rent(minimumLength: 1024 * 1024); // Note the pool can return a larger array
try
{
    LoadData(array);
    var payload = UnsafeByteOperations.UnsafeWrap(array.AsMemory(0, 1024 * 1024));
    var response = new MyResponse { Payload = payload };
    await responseStream.WriteAsync(response);
}
finally
{
    ArrayPool<byte>.Shared.Return(array);
}

I used ByteString.Unsafe.FromBytes(bytes) because it was already there. It was internal and never used. Happy to get rid of the nested class and change to ByteString.UnsafeFromBytes(bytes). Unsafe prefix is the standard in .NET runtime, e.g. https://source.dot.net/#System.Private.CoreLib/Assembly.cs,8536a7569220f81f,references

I went ahead and changed it to UnsafeByteOperations.UnsafeWrap(bytes).

I noticed GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY is now always true. #ifdefs can be removed.

@jtattermusch @jskeet

jtattermusch · 2020-06-29T17:02:58Z

Interesting idea. I think we should come up with a more complete design that also covers how to efficiently parse messages that contain lots of (potentially large) byteStrings.

Btw, Java has a concept of "aliasing" when parsing

protobuf/java/core/src/main/java/com/google/protobuf/CodedInputStream.java

Line 378 in e667bf6

public abstract void enableAliasing(boolean enabled);

- similar ideas might be applicable to C# as well.

JamesNK · 2020-06-30T00:13:07Z

Before doing that I want to make sure that we're ok with the concept of a non-copied ByteString. I saw a comment from @jskeet on a PR that made "Unsafe" internal that content should always be copied.

jskeet · 2020-06-30T05:56:32Z

I'm pretty nervous about this, because it means ByteString can't really be trusted to be immutable any more. Previously, it would be immutable unless there was a bug in Google.Protobuf. With this change, it's immutable unless whoever created it ignores the documentation.

I'd definitely want the protobuf team to weigh in on this, e.g. @haberman and @anandolee.

JamesNK · 2020-06-30T06:11:01Z

What scenarios is an immutable API enabling and who are we trying to protect here?

When it comes to security, the POV on the .NET team is if someone can execute code on the server there it is already game over.

For example, when we expose IReadOnlyList<string> property on a public API the actual instance is usually List<string>. If someone is trying to be malicious, writing some reflection to change a value in the collection is only slightly more work than casting to List<string> and setting an indexer.

I view ByteString the same way. If someone wants to modify it, writing code that uses reflection to modify the underlying byte array is not difficult.

jskeet · 2020-06-30T06:17:31Z

What scenarios is an immutable API enabling and who are we trying to protect here?

All the normal benefits of immutability. I can reason about my code with confidence that the ByteString won't change content between calls. Put it this way: if someone suggested to the .NET team that they should make System.String mutable, I don't think that would go down well. This is like the same, but at a much (much) smaller scale.

[IReadOnlyList<string> example]

I view ByteString the same way.

Whereas I view it more like string - because at the moment it really is like string. (Mind you, I tend to make sure that properties returning IReadOnlyList<string> really are backed by a read-only collection, unless I want them to be internally mutable at which point I document that heavily.)

This isn't a matter of whether someone can do things maliciously - it's more about what I can rely on. If I call a method from a library that returns a ByteString, can I rely on the content not changing? At the moment, I can - with this change, I can't.

JamesNK · 2020-06-30T06:44:14Z

The documentation on ByteString.UnsafeFromBytes says when you can and can't modify the underlying memory. And the Unsafe prefix is a big "here be dragons" warning sign for the average developer.

/// <summary>
/// Constructs a new <see cref="ByteString" /> from the given bytes. The bytes are not copied,
/// and must not be modified while the <see cref="ByteString" /> is in use.
/// </summary>
public static ByteString UnsafeFromBytes(ReadOnlyMemory<byte> bytes)
{
    // ...
}

If someone is modifying the memory after giving the ByteString to someone else then they're misusing it, and shouldn't be surprised if there are unexpected errors.

jskeet · 2020-06-30T06:50:12Z

The documentation on ByteString.UnsafeFromBytes says when you can and can't modify the underlying memory. And the Unsafe prefix is a big "here be dragons" warning sign for the average developer.

Again, would you feel happy introducing String.UnsafeFromChars into the BCL? Any objection you'd naturally have to that is an objection I'd raise here.

If someone is modifying the memory after giving the ByteString to someone else then they're misusing it, and shouldn't be surprised if there are unexpected errors.

They may be misusing it deliberately, or they may be doing so accidentally. That's the problem: currently, I can trust that a ByteString is immutable if there aren't bugs in Google.Protobuf. This is increasing the scope of code I have to trust. And sure, I have to trust code I'm calling to some extent anyway - but making ByteString "only sometimes immutable" feels like a step too far in that direction.

JamesNK · 2020-06-30T07:08:08Z

That's the problem: currently, I can trust that a ByteString is immutable if there aren't bugs in Google.Protobuf.

Your trust is based on people using the public API and not using reflection. What about trusting UnsafeFromBytes will be used correctly 😄

I can have an API that returns ReadOnlyMemory<byte> to the caller, and I assume it won't be modified by the caller. But if the caller wanted to they could use an advanced API like MemoryMarshal.TryGetArray, and modify the underlying array.

I'm not familiar with all the arguments why string must always be mutable. That it is commonly used as a key in hashtables perhaps? If you'd like I could ask a BCL expert to see whether those same restrictions need to apply to something like ByteString.

jskeet · 2020-06-30T07:43:21Z

Your trust is based on people using the public API and not using reflection. What about trusting UnsafeFromBytes will be used correctly 😄

And again, that's true for strings as well. You can use unsafe code to mutate a string - but it's obvious when you do that, and it would be malicious. The public API protects against accidental mutation at the moment.

Strings being immutable makes them much, much easier to work with. I can validate them as a precondition, and know that the precondition holds later on. Immutable types are simply easier to reason about. Sure, it also means they can be used as keys in hashtables etc, but I regard that as less important than the ability to reason about them. (As an example of how it's acknowledge to be a problem, note how java.util.Date is mutable, and everyone took defensive copies and it was a pain to work with - so the java.time API was made immutable instead.)

ByteString is intended to be a binary equivalent of String, and to me that includes the immutability aspect.

Again, I'd say that @haberman and @anandolee are the ultimate arbiters here, but the more comments I've written here, the more strongly opposed to it I am.

JamesNK · 2020-06-30T09:52:56Z

We're only talking about giving the creator of the ByteString the possibility to change it here. The PR isn't adding something like ByteString.Append. There is no need to defensively copy ByteString every time you passed it to method or returned it as a property.

jskeet · 2020-06-30T09:54:43Z

We're only talking about giving the creator of the ByteString the possibility to change it here.

Agreed, but that's still pretty significant IMO. Anyway, I've made it pretty clear I'd prefer not to have this. I don't think we're going to agree on this, so let's see what @haberman and @anandolee think.

JamesNK · 2020-06-30T10:49:07Z

Java Protobuf has this - https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/UnsafeByteOperations.html

I'm happy renaming UnsafeCopyFrom to UnsafeWrap and moving the method to an UnsafeByteOperations helper class.

jskeet · 2020-06-30T10:54:09Z

Okay, if there's precedent in Java then I'm significantly less worried by it. I still think it can lead to really nasty and hard-to-diagnose situations (as described in that Javadoc) but I don't expect it to be any worse in .NET than in Java.

JamesNK · 2020-07-05T23:02:42Z

Updated to mirror Java's UnsafeByteOperations type. I copied Java's more detailed warning.

JamesNK · 2020-08-18T01:18:10Z

Benchmark:

|     Method | BytesToParse |             Mean |          Error |         StdDev |    Gen 0 |    Gen 1 |    Gen 2 |  Allocated |
|----------- |------------- |-----------------:|---------------:|---------------:|---------:|---------:|---------:|-----------:|
|   CopyFrom |            0 |        78.533 ns |      1.0087 ns |      0.9436 ns |   0.0134 |        - |        - |       56 B |
| UnsafeWrap |            0 |         6.998 ns |      0.1038 ns |      0.0867 ns |   0.0076 |        - |        - |       32 B |
|   CopyFrom |         1024 |       174.939 ns |      3.5665 ns |      4.9998 ns |   0.2759 |        - |        - |     1080 B |
| UnsafeWrap |         1024 |         7.541 ns |      0.2258 ns |      0.2416 ns |   0.0076 |        - |        - |       32 B |
|   CopyFrom |       131072 |    10,768.779 ns |    163.2602 ns |    136.3296 ns |  41.6565 |  41.6565 |  41.6565 |   131128 B |
| UnsafeWrap |       131072 |         7.672 ns |      0.2267 ns |      0.2227 ns |   0.0076 |        - |        - |       32 B |
|   CopyFrom |      1048576 |   200,117.367 ns |  3,626.6715 ns |  4,841.5023 ns | 193.3594 | 193.1152 | 193.1152 |  1048778 B |
| UnsafeWrap |      1048576 |         7.755 ns |      0.1604 ns |      0.1422 ns |   0.0076 |        - |        - |       32 B |
|   CopyFrom |     10485760 | 2,328,834.339 ns | 45,607.1246 ns | 85,661.2350 ns | 109.3750 | 109.3750 | 109.3750 | 10485815 B |
| UnsafeWrap |     10485760 |         7.552 ns |      0.2230 ns |      0.5124 ns |   0.0076 |        - |        - |       32 B |

jtattermusch

I think the code changes look fine (I added some comments), but I'm concerned about whether this is actually useful. Along the lines of what's mentioned here #4206 (comment), there are basically these situations:

NOT USEFUL: deserializing from a IBufferReader / Span- useful because the buffer is not owned by us and we don't have control over when it gets released
NOT USEFUL AS IS: deserializing from a buffer we own: if the buffer we own is represented by a ByteString, we could use potentially use aliasing to reference the underlying buffer of the top level ByteString and create instances of ByteString without copying - the problem is that we'd probably need a change to the parse to accommodate this. This PR by itself wouldn't suffice.
USEFUL BUT LEADS TO QUESTIONABLE USAGE PATTERNS: serializing a "manually created" byte string: we can create byte strings from a rented buffer when creating messages we want to serialize, and then make sure the message is serialized before we release the rented buffer. The issue is that this pattern feels clumsy and relies on operations being done in the right order, otherwise we can easily end up with corrupt data. While the performance gain from doing this might be quite tempting in some scenarios, it feels more like a hack than a good practice to do this so I don't think this is something I would recommend to the users. (and if the technique enabled by this PR is not something we feel is worth recommending, then why do it at all?).

I think overall I'm against accepting this PR unless we can come up with a list of use cases that lead to clear performance gains and can be implemented with "clean" code snippets that don't feel too hacky or fragile and which we would feel confident recommending to regular users (I know this is relative but the code example presented in the PR description as "After" doesn't feel that way to me). I don't think this is worth the work if the goal is to only unlock performance gains for a small group of power users that are willing to resort to hacks in order to get there. Instead I'd like to get to some solution that allows most users to benefit from a performance gain and achieve that without making their code significantly more complex/fragile.

Btw, in connection with gRPC, the "After: rented array" pattern from the PR description would be even more complicated because when writing messages, one needs to first await the WriteAsync operation (or the call) before releasing the rented buffer which further complicated the logic.

jtattermusch · 2020-10-13T12:40:57Z

csharp/src/Google.Protobuf/ByteString.cs

    public sealed class ByteString : IEnumerable<byte>, IEquatable<ByteString>
    {
        private static readonly ByteString empty = new ByteString(new byte[0]);

+#if GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY


As mentioned in the comments, looks like GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY is now always true.
I think the right order of actions is to first perform cleanup and then add new functionality.

Can we first have a PR that removes GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY for the C# codebase altogether (I think now it's the right time to do that) and only then modify this PR so we only have one version of the logic in ByteString?

#8011

Also this PR is now rebased on top of it.

csharp/src/Google.Protobuf/ByteString.cs

jtattermusch · 2020-10-13T13:10:21Z

csharp/src/Google.Protobuf/UnsafeByteOperations.cs

+    /// </list>
+    /// </remarks>
+    [SecuritySafeCritical]
+    public static class UnsafeByteOperations


nit: I feel like we don't have to necessarily adopt the naming used in protobuf Java (IMHO the naming java uses isn't very intuitive), but I don't feel strongly about this.

The problem I have with java naming is that "UnsafeByteOperations" sounds very general (and sounds more like it would contain functionality for casting bytes into values of a given type etc.), but in reality it's only about unsafe byte string operations (which seems to be a very narrow subset of "byte operations"). But as said, I'm not feeling too strongly about this and perhaps I'm misunderstanding the spririt of the "UnsafeByteOperations" name.

I belive, that moving UnsafeWrap method to previously existing Unsafe class inside ByteString is good way to go.

Nested classes aren't typical in most public APIs.

Having advanced methods on their own type is a common pattern in .NET APIs. For example MemoryMarshal and Unsafe in .NET Core have advanced methods for working with types and memory.

I think this name is fine and the consistency with Java, which also has ByteString, is nice.

jtattermusch · 2020-10-13T13:17:45Z

csharp/src/Google.Protobuf/ByteString.cs

        /// </summary>
-        internal static class Unsafe
+        internal static ByteString AttachBytes(byte[] bytes)


why rename this method and remove the "Unsafe" class? creating a ReadOnlyMemory from the byte array doesn't make this method any safer
because the caller can still modify the original array and we need to trust him not to do so which is why the method was marked as "Unsafe" (and it should continue being so).

I didn't rename the method. I completely removed ByteString.Unsafe.FromBytes. It was internal and nothing used it.

ByteString.AttachBytes(byte[]) already existed.

jtattermusch · 2020-10-13T13:23:46Z

csharp/src/Google.Protobuf.Benchmarks/ByteStringBenchmark.cs

+        }
+
+        [Benchmark]
+        public ByteString UnsafeWrap()


TBH I don't think this benchmark class brings much value the way it is. What is does is it basically measures how much time it takes to copy a byte array of given size (which takes long time and does allocate memory but everyone knows that) and the time it takes to create a ReadOnlyMemory from bytes. Comparing these two numbers isn't very useful beyond the obvious "first one is very slow and the latter is very fast".
What would be more interesting is comparison of the original ByteString.Unsafe.FromBytes vs new ByteString(ReadOnlyMemory<>) and the original CreateFrom vs the new CreateFrom (which has now a different implementation) - at least we would have a comparison of whether this PR causes any regressions (if so, they would probably be minor, but we should check anyways).

Comparing these two numbers isn't very useful beyond the obvious "first one is very slow and the latter is very fast".

Well the change is basically from doing lots of work to doing nothing. Before creating a byte string involved creating new array, copying all data, garbage collecting array. Now a byte string can be created by using data that you already have available: just reference the existing data.

All this change impacts is creating a ByteString that you then want to serialize. This is an alternative to ByteString.CopyFrom. Comparing ByteString.CopyFrom and UnsafeByteOperation.UnsafeWrap is the most accurate comparison. Slow to fast is why people want this.

What would be more interesting is comparison of the original ByteString.Unsafe.FromBytes vs new ByteString(ReadOnlyMemory<>) and the original CreateFrom vs the new CreateFrom (which has now a different implementation)

Nothing used ByteString.Unsafe.FromBytes. I don't know what you mean by CreateFrom. There isn't an API called that.

I could update the benchmark to create a ByteString and then serialize it if you want, but the time to serialize it will the the same. It will just add noise onto the original number that compares creating ByteString.

kbegiedza · 2020-10-13T14:35:47Z

csharp/src/Google.Protobuf/UnsafeByteOperations.cs

+    /// </list>
+    /// </remarks>
+    [SecuritySafeCritical]
+    public static class UnsafeByteOperations


I belive, that moving UnsafeWrap method to previously existing Unsafe class inside ByteString is good way to go.

kbegiedza · 2020-10-13T14:46:21Z

csharp/src/Google.Protobuf/ByteString.cs

@@ -411,23 +471,55 @@ public bool Equals(ByteString other)
        /// </summary>
        internal void WriteRawBytesTo(CodedOutputStream outputStream)
        {
+#if GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY
+            if (MemoryMarshal.TryGetArray(bytes, out ArraySegment<byte> segment))


To avoid possible redundant allocations I suggest adding span-based WriteRawBytes to CodedOutputStream (internally ref Span<byte> is being used).

This method is internal and nothing used it. I've remove WriteRawBytesTo entirely.

jtattermusch

Left a few comments but otherwise I think this is looking fine.
I'm not a fan of this new API, but it seems there's demand for it and there's no impact on users that don't want to use this.

csharp/src/Google.Protobuf/UnsafeByteOperations.cs

csharp/src/Google.Protobuf/ByteStringAsync.cs

csharp/src/Google.Protobuf.Test/ByteStringTest.cs

csharp/src/Google.Protobuf/ByteStringAsync.cs

jtattermusch

LGTM.

smflorentino · 2020-11-28T22:43:23Z

@jtattermusch - when can we expect a gRPC release with this feature?

jtattermusch · 2020-11-30T14:39:21Z

@jtattermusch - when can we expect a gRPC release with this feature?

In the next release (we missed the 3.14 release branch cut±) - unfortunately I'm not sure when that is going to be.

@acozzette does protobuf team have a release schedule for the open source releases yet?

acozzette · 2020-11-30T19:03:12Z

We don't have a set schedule for releases, but they usually happen once every quarter or so.

aaronhudon · 2021-02-18T22:40:54Z

Glad to see the unsafe helper is making its way back. @jskeet slapped me for asking for it a few years ago :)

jonso4 · 2021-02-20T00:52:35Z

+1, I would love to know is there is a release date for this feature!

smflorentino · 2021-02-20T00:59:47Z

@jonso4, 3.15.0-rc1 is out with the UnsafeByteOperations.UnsafeWrap API. I should know this but don't, @jtattermusch, does either grpc or grpc-dotnet need any changes to use this API to reduce allocations, or were the changes to use Span<T> sufficient?

JamesNK · 2021-02-20T08:37:23Z

gRPC won't use this API. It is used by you in your apps when instantiating your Protobuf generated messages. If you're creating large ByteStrings then this might be something you can use to avoid allocations and copies.

EraYaN · 2021-02-26T09:53:02Z

This does seem to break those apps (For example: https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ActivityExtensions.cs#L346) they will have to update their code (that line creates a null result).

JamesNK · 2021-02-26T10:48:05Z

And that's why you don't use reflection to call private APIs.

The bad news is they're broken, the good news is they can use this new public API.

googlebot added the cla: yes label Jun 24, 2020

jtattermusch added the kokoro:run label Jun 29, 2020

protobuf-kokoro removed the kokoro:run label Jun 29, 2020

jtattermusch added the kokoro:run label Jul 14, 2020

protobuf-kokoro removed the kokoro:run label Jul 14, 2020

JamesNK force-pushed the jamesnk/bytestring-memory branch from 5821555 to f823caa Compare August 18, 2020 01:17

JamesNK mentioned this pull request Sep 3, 2020

Protobuf scalar default values and ByteString detail dotnet/AspNetCore.Docs#19769

Merged

JamesNK mentioned this pull request Oct 6, 2020

C# ByteString, CodedInputStream performance improvements desired for messages with large byte array field #4206

Closed

jtattermusch added release notes: yes c# labels Oct 13, 2020

protobuf-kokoro removed the kokoro:run label Oct 13, 2020

jtattermusch reviewed Oct 13, 2020

View reviewed changes

kbegiedza reviewed Oct 13, 2020

View reviewed changes

JamesNK mentioned this pull request Nov 5, 2020

Remove GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY #8011

Merged

JamesNK force-pushed the jamesnk/bytestring-memory branch 2 times, most recently from 99bda0d to 9b4fb0b Compare November 8, 2020 08:28

jtattermusch added the kokoro:run label Nov 17, 2020

protobuf-kokoro removed the kokoro:run label Nov 17, 2020

jtattermusch approved these changes Nov 17, 2020

View reviewed changes

JamesNK added 2 commits November 18, 2020 21:58

Change ByteString to use memory and support unsafe create without copy

79f5bad

PR feedback

e794919

JamesNK force-pushed the jamesnk/bytestring-memory branch from 9b4fb0b to e794919 Compare November 18, 2020 09:54

jtattermusch added the kokoro:run label Nov 18, 2020

protobuf-kokoro removed the kokoro:run label Nov 18, 2020

jtattermusch approved these changes Nov 18, 2020

View reviewed changes

jtattermusch merged commit d7a2643 into protocolbuffers:master Nov 18, 2020

bill-poole mentioned this pull request Jan 11, 2021

Provide support for UUID type (a.k.a. GUID) #2224

Open

This was referenced Feb 1, 2021

Any plans to use MemoryPool when deserializing ByteString? #8238

Closed

C#, add ByteString.FromStream(MemoryStream stream) overload, to avoid allocating twice #7191

Closed

EraYaN mentioned this pull request Feb 26, 2021

Google.Protobuf 3.15.0 breaks the Otlp Exporter open-telemetry/opentelemetry-dotnet#1854

Closed

jtattermusch mentioned this pull request May 10, 2021

Too many full GCs when receive big responses grpc/grpc#13236

Closed

jtattermusch mentioned this pull request Jun 18, 2021

[csharp] ByteString.CreateCodedInput should use ArraySegment offset and count #8740

Merged

deannagarcia mentioned this pull request Jul 21, 2022

Use pooled memory for ToByteArray in C# #8321

Closed

Change ByteString to use memory and support unsafe create without copy #7645

Change ByteString to use memory and support unsafe create without copy #7645

Conversation

JamesNK commented Jun 24, 2020 • edited Loading

jtattermusch commented Jun 29, 2020 • edited Loading

JamesNK commented Jun 30, 2020

jskeet commented Jun 30, 2020

JamesNK commented Jun 30, 2020 • edited Loading

jskeet commented Jun 30, 2020

JamesNK commented Jun 30, 2020

jskeet commented Jun 30, 2020

JamesNK commented Jun 30, 2020

jskeet commented Jun 30, 2020

JamesNK commented Jun 30, 2020

jskeet commented Jun 30, 2020

JamesNK commented Jun 30, 2020 • edited Loading

jskeet commented Jun 30, 2020

JamesNK commented Jul 5, 2020

JamesNK commented Aug 18, 2020

jtattermusch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamesNK Nov 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtattermusch left a comment

Choose a reason for hiding this comment

jtattermusch left a comment

Choose a reason for hiding this comment

smflorentino commented Nov 28, 2020

jtattermusch commented Nov 30, 2020

acozzette commented Nov 30, 2020

aaronhudon commented Feb 18, 2021

jonso4 commented Feb 20, 2021

smflorentino commented Feb 20, 2021 • edited Loading

JamesNK commented Feb 20, 2021

EraYaN commented Feb 26, 2021

JamesNK commented Feb 26, 2021

JamesNK commented Jun 24, 2020 •

edited

Loading

jtattermusch commented Jun 29, 2020 •

edited

Loading

JamesNK commented Jun 30, 2020 •

edited

Loading

JamesNK commented Jun 30, 2020 •

edited

Loading

JamesNK Nov 5, 2020 •

edited

Loading

smflorentino commented Feb 20, 2021 •

edited

Loading