-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Add Utf8JsonWriter along with unit tests #34425
Conversation
callers behalf instead of throwing.
|
||
public const int MaximumInt64Length = 20; // 19 + sign (i.e. -9223372036854775808) | ||
public const int MaximumUInt64Length = 20; // i.e. 18446744073709551615 | ||
public const int MaximumDoubleLength = 128; // default (i.e. 'G'), using 128 (rather than say 32) to be future-proof. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding, I set these higher than necessary (i.e. 128) as per our discussion. Please review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used for parsing or formatting?
For formatting, you only need 17 digits to guarantee roundtripping (unless the user asks for more). For parsing, you may have to track up to 768 significant digits in order to return the correctly rounded result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used for parsing or formatting?
This is for formatting only. I will add a comment (and update the names).
What should be the max required length for formatting a double using the Utf8Formatter
by default, i.e. when the user doesn't pass a StandardFormat
? Should it be 17 (i.e. the smallest amount that guarantees round tripping) where the implementation of Utf8Formatter
is fixed to behave accordingly (breaking change?)? The Utf8JsonWriter
, by default, calls the default implementation of Utf8Formatter
(whatever default it sets for different .NET types). I want to make sure the max value I am using here will work for that case. Should I therefore keep it at 128? If we can guarantee 17, then it would be great if I can reduce the MaximumDoubleLength
down.
OR
Do you think the Utf8JsonWriter
should be formatting the JSON text in a round-trippable manner by default, anyway, regardless of the default behavior of Utf8JsonWriter
? Then I could set it to something like 17, but it would mean the writer becomes slower.
src/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.cs
Outdated
Show resolved
Hide resolved
// would result in an inconsistent state. Therefore, calling Flush before getting the current state. | ||
if (_buffered != 0) | ||
{ | ||
Flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we Flush here on the user's behalf or change this to a method and throw an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KrzysztofCwalina earlier was talking about Flush vs FlushAsync and such. If the required usage is to call Flush(Async) before being able to snap the state, it seems like a method is the way to go. It also stops a heisen-bug of the debugger ends up calling Flush for you every time you hit F10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We end up losing parity with the Utf8JsonReader.CurrentState
property if we change it to a method, which is why I was flushing for the caller. Are we fine with it diverging here by making it a method that throws?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still time to change the reader version to a method; too. But the property having a side effect of calling Flush seems to me to violate property rules; so this one needs to change.
Properties behaving like properties is more important than "property on the reader => property on the writer".
src/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.cs
Outdated
Show resolved
Hide resolved
src/System.Runtime.Serialization.Formatters/tests/BinaryFormatterTestData.cs
Outdated
Show resolved
Hide resolved
/// <param name="bytesConsumed">On exit, contains the number of bytes that were consumed from the <paramref name="source"/>.</param> | ||
/// <param name="bytesWritten">On exit, contains the number of bytes written to <paramref name="destination"/></param> | ||
/// <returns>A <see cref="OperationStatus"/> value representing the state of the conversion.</returns> | ||
public unsafe static OperationStatus ToUtf8(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesConsumed, out int bytesWritten) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method really necessary compared to the normal Encoder methods? If yes, maybe move it to an extra file? This is a very specific implementation, with important comments and should not be changed individually without good cause, putting it into a different file might help to make sure this is never touched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: source
=> utf16source
, destination
=> utf8destination
, or something like that. In 99.999% of the rest of corefx, ReadOnlySpan<byte>
in the context of text means UTF8, so the fact that the source here is UTF16 is surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I also agree with @Tornhoof: it'd be really good if this could use the existing UTF8 encoder in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really be done as code reuse (API or shared source), not code copying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should really be done as code reuse (API or shared source), not code copying.
I agree but this is a point-in-time concern.
I brought in this API (along with the supporting helpers) from corefxlab (https://github.com/dotnet/corefxlab/blob/0abb70efee949cff78a6b485f022dcd2e7896644/src/System.Text.Primitives/System/Text/Encoders/Utf16.cs#L108) to remove the dependency on https://github.com/dotnet/corefx/issues/34094. Ideally, once the Utf8 class is made public, we would replace this implementation (along with having to convert char -> byte while keeping it as UTF-16). I need the operation status based transcoding.
https://github.com/dotnet/corefxlab/blob/0abb70efee949cff78a6b485f022dcd2e7896644/src/System.Text.Primitives/System/Text/Encoders/Utf16.cs#L108
I will add a comment to replace this in the (near?) future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the operation status based transcoding.
Per #34425 (comment), do we really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really?
I need to be able to do partial writes to the destination buffer since it is possible that the destination is too small for the given input.
I can't do something like the following since it would throw if _buffer
was too small.
int written = Encoding.UTF8.GetBytes(escapedValue, _buffer.Slice(idx));
I would have to use the pointer overload and calculate how many characters will fit into the leftover destination buffer, up front.
Maybe something like:
fixed (char* charPtr = temp1)
{
fixed (byte* bytePtr = destination)
{
int written = Encoding.UTF8.GetBytes(charPtr, destination.Length / 3, bytePtr, destination.Length);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these implementations have diverged with all the intrinsic work put into corelib version and comments are outdated https://github.com/dotnet/runtime/blob/eeef4b14c5a137c2f665a9be4f30e4383d1038da/src/libraries/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.Transcoding.cs#L28
@eiriktsarpalis can it be replaced now or is it permanent?
btw OperationStatus can be replaced with value tuple throughout this project (even for .net framework).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, could you create a new issue please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.DateTime.cs
Outdated
Show resolved
Hide resolved
src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.String.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the specific Write partial classes for the individual supported types, are those auto-generated or manually created?
src/System.Text.Json/src/System/Text/Json/Writer/UnicodeScalar.cs
Outdated
Show resolved
Hide resolved
src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.DateTime.cs
Show resolved
Hide resolved
src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Guid.cs
Show resolved
Hide resolved
src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Helpers.cs
Outdated
Show resolved
Hide resolved
src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.Helpers.cs
Outdated
Show resolved
Hide resolved
// and exclude characters that need to be escaped by adding a backslash: '\n', '\r', '\t', '\\', '/', '\b', '\f' | ||
// | ||
// non-zero = allowed, 0 = disallowed | ||
private static ReadOnlySpan<byte> AllowList => new byte[byte.MaxValue + 1] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@GrabYourPitchforks, please review the escaping logic here. Note that this is a temporary workaround and we should consider replacing this implementation with a publicly shipping API once we have something suitable.
} | ||
|
||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static void ValidateProperty(ref ReadOnlySpan<byte> propertyName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a lot of places that pass Span as ref
. It makes the code very hard to reason about because of you can never tell whether the methods change the argument. Could you please change the Spans to be passed as in
where the value is not changed by the caller and passing by reference is just used as an optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and passing by reference is just used as an optimization?
Is it actually a meaningful optimization? We don't do it in pretty much any other place we pass around spans, and if it was meaningful, I'd expect it to be something the JIT should do instead of forcing the developer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the thought that in
was probably meant, too; then wondered if it was because the compiler makes defensive copies when Slice
is called, making in
an anti-optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it actually a meaningful optimization?
In certain cases, it improves performance by 10-20% (in
or ref
rather than pass by value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In certain cases, it improves performance by 10-20% (in or ref rather than pass by value)
This call site makes a 10-20% difference? On what test? You can prove me wrong, but I find that really hard to believe. And if it's true, to me that suggests a codegen problem that should be fixed rather than contorting our code.
I could believe that there are a few methods where for one reason or another the extra word argument pushes it over some threshold in some optimization, maybe some argument that could have been passed in a register is now passed on the stack, or some such thing. But I'm very skeptical that it then translates into benefiting from passing around all of these by ref or in.
I don't like this practice for a few reasons:
- It causes complications when trying to do things like stackalloc spans due to compiler errors when passing around multiple ref structs, some by ref. And in many cases, the compiler is flagging very real problems; as Jeremy highlighted, the hack to use unsafe to stackalloc a pointer and create a span from it in order to silence the compiler error actually resulted in leaking garbage pointers up the stack.
- We do not believe customers should have to write code like this, and applying such patterns across a codebase (and one that we're distributing in a source package no less) will suggest to folks "this is how you write code with spans", when it's not actually what we recommend.
- If it's really, truly necessary to get good performance, then that suggests we have much deeper rooted issues that should be addressed at their core rather than working around it in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still believe it is in our collective best interest to pass small structs
Follow up question:
Should we pass by value even if the span is being passed around in a deep call graph?
For example, something like this:
WriteStringValue(ReadOnlySpan<char> value, bool suppressEscaping = false)
=> WriteStringSuppressFalse(in ReadOnlySpan<char> value)
=> WriteStringByOptions(in ReadOnlySpan<char> value)
=> WriteStringMinimized(in ReadOnlySpan<char> escapedValue)
=> WriteStringValue(in ReadOnlySpan<char> escapedValue, ref int idx)
Similarly, when we have 2 spans as arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is example of callstack where it would be ok to use some amount of AggresiveInlining to reduce its depth.
For example, WriteStringSuppressFalse has just a single callsite, so it would ok to inline it into the caller - either using AggresiveInlining; or by just not having the separate method at all and just inlining the code in manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it would ok to inline it into the caller - either using AggresiveInlining; or by just not having the separate method at all and just inlining the code in manually.
Makes sense. I don't want to remove the separate method since it is easier to read what the code is doing when its split up into smaller methods. I will keep # of callsites in mind and mark methods as AI if they have a single reference (and I notice a performance improvement when doing so).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, yes.
Whether in
actually helps depends on a lot of factors: what each level does with the span, the relative frequencies of operations on the span, the impact of other code in each method, whether the intermediate layers get inlined, the target ABI, and so on. And the perf characteristics will change over time as the jit improves handling of structs in general.
If -- as your example implies -- you simply pass the span down to a single callee then in
and by-value (implicit by-ref, on windows) will generate identical code on Windows x64 as the jit can recognize that in the apparent by-value case no copy is needed.
src/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.cs
Outdated
Show resolved
Hide resolved
src/System.Text.Json/src/System/Text/Json/Writer/JsonWriterHelper.cs
Outdated
Show resolved
Hide resolved
src/System.Text.Json/src/System/Text/Json/Writer/UnicodeScalar.cs
Outdated
Show resolved
Hide resolved
Any other feedback? I would like to merge this as soon as possible, unless there is blocking feedback. I have addressed (and resolved) all the actionable feedback that's within the scope of this PR.
@bartonjs, I am no longer passing spans by ref, so this issue has been resolved. |
@dotnet-bot test Windows x64 Debug Build |
src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.DateTime.cs
Outdated
Show resolved
Hide resolved
src/System.Text.Json/src/System/Text/Json/Writer/Utf8JsonWriter.WriteProperties.DateTime.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, preferably with ReadOnlySpan<byte>
method parameters (things in the input role that ideally would have been char8/utf8string) given the utf8 prefix as recently discussed :smile;
@dotnet-bot test Windows x64 Debug Build |
@dotnet-bot test Windows x64 Debug Build |
* Move JsonReader related files to the Reader directory. * Update S.T.Json ref to include new JsonWriter APIs. * Port initial JsonWriter files from corefxlab. * Auto-generate System.Text.Json ref to update the ordering for consistency. * Fixed some TODOs, delete dead code, and formatting cleanup. * Make use of self-descriptive constants where possible. * Fix leftover TODOs and update throw helper/exception messages. * Add xml comments/docs to the public surface area. * Add JsonWriter unit tests. * Change GetCurrentState back to a property and explicitly Flush on the callers behalf instead of throwing. * Save current depth as part of state and update tests. * Fix constant names, account for quotes in bytes needed, and add fixed buffer writer tests. * Fix inconsistent use of braces by adding them for single line ifs * Update parameter name to exclude encoding. * Remove JsonWriterException and use InvalidOperationException instead. * Use Rune and Utf8Formatter/TryFormat in more places and remove UnicodeScalar. * Fix nits, typos, and reorder field assignment and method calls. * Pass spans by in (or by value) instead of by ref. * Update comments and remove unnecessary test. * Remove some aggressive inlining and pass spans by value rather than in * Update comments, dont compute bytes needed, and use if instead of loop before advancing. * Reduce code bloat by removing duplciate calls to ValidateX. * Add details on how .NET types are formatted to comments. * Reduce code duplication when writing values. * Change the StandardFormat used for DateTime(Offset) to 'O' * Refactor calculating the maximum escaped length to a helper. * Remove unnecessary checks and rename locals to be more descriptive. * Rename suppressEscaping to escape and flip default from false to true. * Comment cleanup, add debug.asserts, and move transcoding helpers to a separate file. * Increase the deicmal max size to account for sign and add tests. * Rename ROS<byte> property name and value params to include utf8 in the name. * Remove redundant code where idx is set to 0 unnecessarily. * Remove dead code (dont escape forward slash) and make tests culture invariant. Commit migrated from dotnet/corefx@f84927d
Implements the priority-0 APIs proposed in https://github.com/dotnet/corefx/issues/33552
~90% line and branch coverage
TODO:
PipeWriter
and customIBufferWriter
implementations that have different semantics (such asFixedSizedBufferWriter
)). Issue to improve code coverage: https://github.com/dotnet/corefx/issues/34570WriteNumberArray
andWriteStringArray
"accelerator" APIs to the future (i.e. outside of this PR) as they are not crucial and mainly there for convenience/performance.cc @joshfree, @KrzysztofCwalina, @bartonjs, @steveharter, @stephentoub, @GrabYourPitchforks, @davidfowl, @tannergooding, @Tornhoof