Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change ByteString to use memory and support unsafe create without copy #7645

Merged

Conversation

JamesNK
Copy link
Contributor

@JamesNK JamesNK commented Jun 24, 2020

Fixes #4206

Change ByteString to use ReadOnlyMemory<byte> field, and provide overload that allows ByteString to be created without allocating and copying memory.

Before:

var array = new byte[1024 * 1024]; // Large allocation
LoadData(array);
var payload = ByteString.FromBytes(array); // Large allocation and copy
var response = new MyResponse { Payload = payload };
await responseStream.WriteAsync(response);

After:

var array = ArrayPool<byte>.Shared.Rent(minimumLength: 1024 * 1024); // Note the pool can return a larger array
try
{
    LoadData(array);
    var payload = UnsafeByteOperations.UnsafeWrap(array.AsMemory(0, 1024 * 1024));
    var response = new MyResponse { Payload = payload };
    await responseStream.WriteAsync(response);
}
finally
{
    ArrayPool<byte>.Shared.Return(array);
}

I used ByteString.Unsafe.FromBytes(bytes) because it was already there. It was internal and never used. Happy to get rid of the nested class and change to ByteString.UnsafeFromBytes(bytes). Unsafe prefix is the standard in .NET runtime, e.g. https://source.dot.net/#System.Private.CoreLib/Assembly.cs,8536a7569220f81f,references

I went ahead and changed it to UnsafeByteOperations.UnsafeWrap(bytes).

I noticed GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY is now always true. #ifdefs can be removed.

@jtattermusch @jskeet

@jtattermusch
Copy link
Contributor

jtattermusch commented Jun 29, 2020

Interesting idea. I think we should come up with a more complete design that also covers how to efficiently parse messages that contain lots of (potentially large) byteStrings.

Btw, Java has a concept of "aliasing" when parsing

public abstract void enableAliasing(boolean enabled);
- similar ideas might be applicable to C# as well.

@JamesNK
Copy link
Contributor Author

JamesNK commented Jun 30, 2020

Before doing that I want to make sure that we're ok with the concept of a non-copied ByteString. I saw a comment from @jskeet on a PR that made "Unsafe" internal that content should always be copied.

@jskeet
Copy link
Contributor

jskeet commented Jun 30, 2020

I'm pretty nervous about this, because it means ByteString can't really be trusted to be immutable any more. Previously, it would be immutable unless there was a bug in Google.Protobuf. With this change, it's immutable unless whoever created it ignores the documentation.

I'd definitely want the protobuf team to weigh in on this, e.g. @haberman and @anandolee.

@JamesNK
Copy link
Contributor Author

JamesNK commented Jun 30, 2020

What scenarios is an immutable API enabling and who are we trying to protect here?

When it comes to security, the POV on the .NET team is if someone can execute code on the server there it is already game over.

For example, when we expose IReadOnlyList<string> property on a public API the actual instance is usually List<string>. If someone is trying to be malicious, writing some reflection to change a value in the collection is only slightly more work than casting to List<string> and setting an indexer.

I view ByteString the same way. If someone wants to modify it, writing code that uses reflection to modify the underlying byte array is not difficult.

@jskeet
Copy link
Contributor

jskeet commented Jun 30, 2020

What scenarios is an immutable API enabling and who are we trying to protect here?

All the normal benefits of immutability. I can reason about my code with confidence that the ByteString won't change content between calls. Put it this way: if someone suggested to the .NET team that they should make System.String mutable, I don't think that would go down well. This is like the same, but at a much (much) smaller scale.

[IReadOnlyList<string> example]

I view ByteString the same way.

Whereas I view it more like string - because at the moment it really is like string. (Mind you, I tend to make sure that properties returning IReadOnlyList<string> really are backed by a read-only collection, unless I want them to be internally mutable at which point I document that heavily.)

This isn't a matter of whether someone can do things maliciously - it's more about what I can rely on. If I call a method from a library that returns a ByteString, can I rely on the content not changing? At the moment, I can - with this change, I can't.

@JamesNK
Copy link
Contributor Author

JamesNK commented Jun 30, 2020

The documentation on ByteString.UnsafeFromBytes says when you can and can't modify the underlying memory. And the Unsafe prefix is a big "here be dragons" warning sign for the average developer.

/// <summary>
/// Constructs a new <see cref="ByteString" /> from the given bytes. The bytes are not copied,
/// and must not be modified while the <see cref="ByteString" /> is in use.
/// </summary>
public static ByteString UnsafeFromBytes(ReadOnlyMemory<byte> bytes)
{
    // ...
}

If someone is modifying the memory after giving the ByteString to someone else then they're misusing it, and shouldn't be surprised if there are unexpected errors.

@jskeet
Copy link
Contributor

jskeet commented Jun 30, 2020

The documentation on ByteString.UnsafeFromBytes says when you can and can't modify the underlying memory. And the Unsafe prefix is a big "here be dragons" warning sign for the average developer.

Again, would you feel happy introducing String.UnsafeFromChars into the BCL? Any objection you'd naturally have to that is an objection I'd raise here.

If someone is modifying the memory after giving the ByteString to someone else then they're misusing it, and shouldn't be surprised if there are unexpected errors.

They may be misusing it deliberately, or they may be doing so accidentally. That's the problem: currently, I can trust that a ByteString is immutable if there aren't bugs in Google.Protobuf. This is increasing the scope of code I have to trust. And sure, I have to trust code I'm calling to some extent anyway - but making ByteString "only sometimes immutable" feels like a step too far in that direction.

@JamesNK
Copy link
Contributor Author

JamesNK commented Jun 30, 2020

That's the problem: currently, I can trust that a ByteString is immutable if there aren't bugs in Google.Protobuf.

Your trust is based on people using the public API and not using reflection. What about trusting UnsafeFromBytes will be used correctly 😄

I can have an API that returns ReadOnlyMemory<byte> to the caller, and I assume it won't be modified by the caller. But if the caller wanted to they could use an advanced API like MemoryMarshal.TryGetArray, and modify the underlying array.

I'm not familiar with all the arguments why string must always be mutable. That it is commonly used as a key in hashtables perhaps? If you'd like I could ask a BCL expert to see whether those same restrictions need to apply to something like ByteString.

@jskeet
Copy link
Contributor

jskeet commented Jun 30, 2020

Your trust is based on people using the public API and not using reflection. What about trusting UnsafeFromBytes will be used correctly 😄

And again, that's true for strings as well. You can use unsafe code to mutate a string - but it's obvious when you do that, and it would be malicious. The public API protects against accidental mutation at the moment.

Strings being immutable makes them much, much easier to work with. I can validate them as a precondition, and know that the precondition holds later on. Immutable types are simply easier to reason about. Sure, it also means they can be used as keys in hashtables etc, but I regard that as less important than the ability to reason about them. (As an example of how it's acknowledge to be a problem, note how java.util.Date is mutable, and everyone took defensive copies and it was a pain to work with - so the java.time API was made immutable instead.)

ByteString is intended to be a binary equivalent of String, and to me that includes the immutability aspect.

Again, I'd say that @haberman and @anandolee are the ultimate arbiters here, but the more comments I've written here, the more strongly opposed to it I am.

@JamesNK
Copy link
Contributor Author

JamesNK commented Jun 30, 2020

We're only talking about giving the creator of the ByteString the possibility to change it here. The PR isn't adding something like ByteString.Append. There is no need to defensively copy ByteString every time you passed it to method or returned it as a property.

@jskeet
Copy link
Contributor

jskeet commented Jun 30, 2020

We're only talking about giving the creator of the ByteString the possibility to change it here.

Agreed, but that's still pretty significant IMO. Anyway, I've made it pretty clear I'd prefer not to have this. I don't think we're going to agree on this, so let's see what @haberman and @anandolee think.

@JamesNK
Copy link
Contributor Author

JamesNK commented Jun 30, 2020

Java Protobuf has this - https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/UnsafeByteOperations.html

I'm happy renaming UnsafeCopyFrom to UnsafeWrap and moving the method to an UnsafeByteOperations helper class.

@jskeet
Copy link
Contributor

jskeet commented Jun 30, 2020

Okay, if there's precedent in Java then I'm significantly less worried by it. I still think it can lead to really nasty and hard-to-diagnose situations (as described in that Javadoc) but I don't expect it to be any worse in .NET than in Java.

@JamesNK
Copy link
Contributor Author

JamesNK commented Jul 5, 2020

Updated to mirror Java's UnsafeByteOperations type. I copied Java's more detailed warning.

@JamesNK
Copy link
Contributor Author

JamesNK commented Aug 18, 2020

Benchmark:

|     Method | BytesToParse |             Mean |          Error |         StdDev |    Gen 0 |    Gen 1 |    Gen 2 |  Allocated |
|----------- |------------- |-----------------:|---------------:|---------------:|---------:|---------:|---------:|-----------:|
|   CopyFrom |            0 |        78.533 ns |      1.0087 ns |      0.9436 ns |   0.0134 |        - |        - |       56 B |
| UnsafeWrap |            0 |         6.998 ns |      0.1038 ns |      0.0867 ns |   0.0076 |        - |        - |       32 B |
|   CopyFrom |         1024 |       174.939 ns |      3.5665 ns |      4.9998 ns |   0.2759 |        - |        - |     1080 B |
| UnsafeWrap |         1024 |         7.541 ns |      0.2258 ns |      0.2416 ns |   0.0076 |        - |        - |       32 B |
|   CopyFrom |       131072 |    10,768.779 ns |    163.2602 ns |    136.3296 ns |  41.6565 |  41.6565 |  41.6565 |   131128 B |
| UnsafeWrap |       131072 |         7.672 ns |      0.2267 ns |      0.2227 ns |   0.0076 |        - |        - |       32 B |
|   CopyFrom |      1048576 |   200,117.367 ns |  3,626.6715 ns |  4,841.5023 ns | 193.3594 | 193.1152 | 193.1152 |  1048778 B |
| UnsafeWrap |      1048576 |         7.755 ns |      0.1604 ns |      0.1422 ns |   0.0076 |        - |        - |       32 B |
|   CopyFrom |     10485760 | 2,328,834.339 ns | 45,607.1246 ns | 85,661.2350 ns | 109.3750 | 109.3750 | 109.3750 | 10485815 B |
| UnsafeWrap |     10485760 |         7.552 ns |      0.2230 ns |      0.5124 ns |   0.0076 |        - |        - |       32 B |

Copy link
Contributor

@jtattermusch jtattermusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code changes look fine (I added some comments), but I'm concerned about whether this is actually useful. Along the lines of what's mentioned here #4206 (comment), there are basically these situations:

  • NOT USEFUL: deserializing from a IBufferReader / Span- useful because the buffer is not owned by us and we don't have control over when it gets released

  • NOT USEFUL AS IS: deserializing from a buffer we own: if the buffer we own is represented by a ByteString, we could use potentially use aliasing to reference the underlying buffer of the top level ByteString and create instances of ByteString without copying - the problem is that we'd probably need a change to the parse to accommodate this. This PR by itself wouldn't suffice.

  • USEFUL BUT LEADS TO QUESTIONABLE USAGE PATTERNS: serializing a "manually created" byte string: we can create byte strings from a rented buffer when creating messages we want to serialize, and then make sure the message is serialized before we release the rented buffer. The issue is that this pattern feels clumsy and relies on operations being done in the right order, otherwise we can easily end up with corrupt data. While the performance gain from doing this might be quite tempting in some scenarios, it feels more like a hack than a good practice to do this so I don't think this is something I would recommend to the users. (and if the technique enabled by this PR is not something we feel is worth recommending, then why do it at all?).

I think overall I'm against accepting this PR unless we can come up with a list of use cases that lead to clear performance gains and can be implemented with "clean" code snippets that don't feel too hacky or fragile and which we would feel confident recommending to regular users (I know this is relative but the code example presented in the PR description as "After" doesn't feel that way to me). I don't think this is worth the work if the goal is to only unlock performance gains for a small group of power users that are willing to resort to hacks in order to get there. Instead I'd like to get to some solution that allows most users to benefit from a performance gain and achieve that without making their code significantly more complex/fragile.

Btw, in connection with gRPC, the "After: rented array" pattern from the PR description would be even more complicated because when writing messages, one needs to first await the WriteAsync operation (or the call) before releasing the rented buffer which further complicated the logic.

public sealed class ByteString : IEnumerable<byte>, IEquatable<ByteString>
{
private static readonly ByteString empty = new ByteString(new byte[0]);

#if GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the comments, looks like GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY is now always true.
I think the right order of actions is to first perform cleanup and then add new functionality.

Can we first have a PR that removes GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY for the C# codebase altogether (I think now it's the right time to do that) and only then modify this PR so we only have one version of the logic in ByteString?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#8011

Also this PR is now rebased on top of it.

csharp/src/Google.Protobuf/ByteString.cs Outdated Show resolved Hide resolved
/// </list>
/// </remarks>
[SecuritySafeCritical]
public static class UnsafeByteOperations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like we don't have to necessarily adopt the naming used in protobuf Java (IMHO the naming java uses isn't very intuitive), but I don't feel strongly about this.

The problem I have with java naming is that "UnsafeByteOperations" sounds very general (and sounds more like it would contain functionality for casting bytes into values of a given type etc.), but in reality it's only about unsafe byte string operations (which seems to be a very narrow subset of "byte operations"). But as said, I'm not feeling too strongly about this and perhaps I'm misunderstanding the spririt of the "UnsafeByteOperations" name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belive, that moving UnsafeWrap method to previously existing Unsafe class inside ByteString is good way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested classes aren't typical in most public APIs.

Having advanced methods on their own type is a common pattern in .NET APIs. For example MemoryMarshal and Unsafe in .NET Core have advanced methods for working with types and memory.

I think this name is fine and the consistency with Java, which also has ByteString, is nice.

/// </summary>
internal static class Unsafe
internal static ByteString AttachBytes(byte[] bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why rename this method and remove the "Unsafe" class? creating a ReadOnlyMemory from the byte array doesn't make this method any safer
because the caller can still modify the original array and we need to trust him not to do so which is why the method was marked as "Unsafe" (and it should continue being so).

Copy link
Contributor Author

@JamesNK JamesNK Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't rename the method. I completely removed ByteString.Unsafe.FromBytes. It was internal and nothing used it.

ByteString.AttachBytes(byte[]) already existed.

}

[Benchmark]
public ByteString UnsafeWrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I don't think this benchmark class brings much value the way it is. What is does is it basically measures how much time it takes to copy a byte array of given size (which takes long time and does allocate memory but everyone knows that) and the time it takes to create a ReadOnlyMemory from bytes. Comparing these two numbers isn't very useful beyond the obvious "first one is very slow and the latter is very fast".
What would be more interesting is comparison of the original ByteString.Unsafe.FromBytes vs new ByteString(ReadOnlyMemory<>) and the original CreateFrom vs the new CreateFrom (which has now a different implementation) - at least we would have a comparison of whether this PR causes any regressions (if so, they would probably be minor, but we should check anyways).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing these two numbers isn't very useful beyond the obvious "first one is very slow and the latter is very fast".

Well the change is basically from doing lots of work to doing nothing. Before creating a byte string involved creating new array, copying all data, garbage collecting array. Now a byte string can be created by using data that you already have available: just reference the existing data.

All this change impacts is creating a ByteString that you then want to serialize. This is an alternative to ByteString.CopyFrom. Comparing ByteString.CopyFrom and UnsafeByteOperation.UnsafeWrap is the most accurate comparison. Slow to fast is why people want this.

What would be more interesting is comparison of the original ByteString.Unsafe.FromBytes vs new ByteString(ReadOnlyMemory<>) and the original CreateFrom vs the new CreateFrom (which has now a different implementation)

Nothing used ByteString.Unsafe.FromBytes. I don't know what you mean by CreateFrom. There isn't an API called that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could update the benchmark to create a ByteString and then serialize it if you want, but the time to serialize it will the the same. It will just add noise onto the original number that compares creating ByteString.

/// </list>
/// </remarks>
[SecuritySafeCritical]
public static class UnsafeByteOperations

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belive, that moving UnsafeWrap method to previously existing Unsafe class inside ByteString is good way to go.

@@ -411,23 +471,55 @@ public bool Equals(ByteString other)
/// </summary>
internal void WriteRawBytesTo(CodedOutputStream outputStream)
{
#if GOOGLE_PROTOBUF_SUPPORT_SYSTEM_MEMORY
if (MemoryMarshal.TryGetArray(bytes, out ArraySegment<byte> segment))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid possible redundant allocations I suggest adding span-based WriteRawBytes to CodedOutputStream (internally ref Span<byte> is being used).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is internal and nothing used it. I've remove WriteRawBytesTo entirely.

@JamesNK JamesNK force-pushed the jamesnk/bytestring-memory branch 2 times, most recently from 99bda0d to 9b4fb0b Compare November 8, 2020 08:28
Copy link
Contributor

@jtattermusch jtattermusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but otherwise I think this is looking fine.
I'm not a fan of this new API, but it seems there's demand for it and there's no impact on users that don't want to use this.

csharp/src/Google.Protobuf/UnsafeByteOperations.cs Outdated Show resolved Hide resolved
csharp/src/Google.Protobuf/ByteStringAsync.cs Outdated Show resolved Hide resolved
csharp/src/Google.Protobuf/ByteStringAsync.cs Show resolved Hide resolved
Copy link
Contributor

@jtattermusch jtattermusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@jtattermusch jtattermusch merged commit d7a2643 into protocolbuffers:master Nov 18, 2020
@smflorentino
Copy link

@jtattermusch - when can we expect a gRPC release with this feature?

@jtattermusch
Copy link
Contributor

@jtattermusch - when can we expect a gRPC release with this feature?

In the next release (we missed the 3.14 release branch cut±) - unfortunately I'm not sure when that is going to be.

@acozzette does protobuf team have a release schedule for the open source releases yet?

@acozzette
Copy link
Member

We don't have a set schedule for releases, but they usually happen once every quarter or so.

@aaronhudon
Copy link

Glad to see the unsafe helper is making its way back. @jskeet slapped me for asking for it a few years ago :)

@jonso4
Copy link

jonso4 commented Feb 20, 2021

+1, I would love to know is there is a release date for this feature!

@smflorentino
Copy link

smflorentino commented Feb 20, 2021

@jonso4, 3.15.0-rc1 is out with the UnsafeByteOperations.UnsafeWrap API. I should know this but don't, @jtattermusch, does either grpc or grpc-dotnet need any changes to use this API to reduce allocations, or were the changes to use Span<T> sufficient?

@JamesNK
Copy link
Contributor Author

JamesNK commented Feb 20, 2021

gRPC won't use this API. It is used by you in your apps when instantiating your Protobuf generated messages. If you're creating large ByteStrings then this might be something you can use to avoid allocations and copies.

@EraYaN
Copy link

EraYaN commented Feb 26, 2021

This does seem to break those apps (For example: https://github.com/open-telemetry/opentelemetry-dotnet/blob/main/src/OpenTelemetry.Exporter.OpenTelemetryProtocol/Implementation/ActivityExtensions.cs#L346) they will have to update their code (that line creates a null result).

@JamesNK
Copy link
Contributor Author

JamesNK commented Feb 26, 2021

And that's why you don't use reflection to call private APIs.

The bad news is they're broken, the good news is they can use this new public API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

C# ByteString, CodedInputStream performance improvements desired for messages with large byte array field