-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to stream large strings in Utf8JsonWriter/Utf8JsonReader #67337
Comments
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsBackground and motivationI have a requirement to write large binary content in json.
I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes. public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
writer.WriteStartObject();
writer.WritePropertyName("data");
byte[] byteArray = ReadBinaryData(reader).;
string base64data = Convert.ToBase64String(byteArray);
writer.WriteStringValue(base64data);
writer.WriteEndObject();
}
public byte[] ReadBinaryData(PipeReader reader)
{
List<byte> bytes = new List<byte>();
while (reader.TryRead(out ReadResult result))
{
ReadOnlySequence<byte> buffer = result.Buffer;
bytes.AddRange(buffer.ToArray());
// Tell the PipeReader how much of the buffer has been consumed.
reader.AdvanceTo(buffer.End);
// Stop reading if there's no more data coming.
if (result.IsCompleted)
{
break;
}
}
// Mark the PipeReader as complete.
await reader.Complete();
byte[] byteArray = bytes.ToArray();
return byteArray;
} The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it. Instead I am proposing a way to stream large binary content. API Proposalnamespace System.Text.Json
{
public class Utf8JsonWriter : IAsyncDisposable, IDisposable
{
// This returns a stream on which binary data can be written.
// It will encode it to base64 and write to output stream.
public Stream CreateBinaryWriteStream();
}
} API Usagepublic void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
writer.WriteStartObject();
writer.WritePropertyName("data");
using (Stream binaryStream = writer.CreateBinaryWriteStream())
{
StreamBinaryData(reader, binaryStream);
binaryStream.Flush();
}
writer.WriteEndObject();
}
public void StreamBinaryData(PipeReader reader, Stream stream)
{
List<byte> bytes = new List<byte>();
while (reader.TryRead(out ReadResult result))
{
ReadOnlySequence<byte> buffer = result.Buffer;
byte[] byteArray = buffer.ToArray();
stream.Write(byteArray, 0, byteArray.Length);
stream.Flush();
// Tell the PipeReader how much of the buffer has been consumed.
reader.AdvanceTo(buffer.End);
// Stop reading if there's no more data coming.
if (result.IsCompleted)
{
break;
}
}
// Mark the PipeReader as complete.
await reader.Complete();
} Alternative DesignsNo response RisksNo response
|
Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis Issue DetailsBackground and motivationI have a requirement to write large binary content in json.
I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes. public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
writer.WriteStartObject();
writer.WritePropertyName("data");
byte[] byteArray = ReadBinaryData(reader).;
string base64data = Convert.ToBase64String(byteArray);
writer.WriteStringValue(base64data);
writer.WriteEndObject();
}
public byte[] ReadBinaryData(PipeReader reader)
{
List<byte> bytes = new List<byte>();
while (reader.TryRead(out ReadResult result))
{
ReadOnlySequence<byte> buffer = result.Buffer;
bytes.AddRange(buffer.ToArray());
// Tell the PipeReader how much of the buffer has been consumed.
reader.AdvanceTo(buffer.End);
// Stop reading if there's no more data coming.
if (result.IsCompleted)
{
break;
}
}
// Mark the PipeReader as complete.
await reader.Complete();
byte[] byteArray = bytes.ToArray();
return byteArray;
} The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it. Memory consumption is critical when using Utf8JsonWriter in the override of Instead I am proposing a way to stream large binary content. API Proposalnamespace System.Text.Json
{
public class Utf8JsonWriter : IAsyncDisposable, IDisposable
{
// This returns a stream on which binary data can be written.
// It will encode it to base64 and write to output stream.
public Stream CreateBinaryWriteStream();
}
} API Usagepublic void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
writer.WriteStartObject();
writer.WritePropertyName("data");
using (Stream binaryStream = writer.CreateBinaryWriteStream())
{
StreamBinaryData(reader, binaryStream);
binaryStream.Flush();
}
writer.WriteEndObject();
}
public void StreamBinaryData(PipeReader reader, Stream stream)
{
List<byte> bytes = new List<byte>();
while (reader.TryRead(out ReadResult result))
{
ReadOnlySequence<byte> buffer = result.Buffer;
byte[] byteArray = buffer.ToArray();
stream.Write(byteArray, 0, byteArray.Length);
stream.Flush();
// Tell the PipeReader how much of the buffer has been consumed.
reader.AdvanceTo(buffer.End);
// Stop reading if there's no more data coming.
if (result.IsCompleted)
{
break;
}
}
// Mark the PipeReader as complete.
await reader.Complete();
} Alternative DesignsNo response RisksNo response
|
I think it should be an |
Personally, I prefer the idea of using |
That's also an option, since this is the type's only purpose. |
What's the reason for this constraint? Wouldn't it be more efficient to serve it up directly instead of embedding in a JSON payload? |
We want to serve multiple binary content in the JSON with some additional fields.
|
@anhadi2 wouldn't it make more sense to append that data after the JSON is complete? I.e.:
|
This will not work for us. We want the response to be a valid JSON. |
Something you could do is put a URL to the data in your JSON instead of the data themselves. The data will be transmitted in binary and much more efficiently than Base64. Unless you want to persist this JSON file. |
Thanks for the suggestion. Yes, there might be other ways to do this. |
This issue has been marked |
This issue has been automatically marked |
This issue will now be closed since it had been marked |
Such an api would be useful for us as well. We are sending data to a 3rd party api that we cannot change, and that api expects files to be sent base64 encoded as part of a json object. The documents can be up to 100 MB in size, we'd like to avoid having to load the complete document into memory. |
We are in the same position of sending large base64 data in JSON to a 3rd party and being unable to change the contract. @krwq You added the needs-author-action tag which seems to have led to this issue being closed. Since a couple other people have responded now requesting this feature, can it be reopened? To answer your question about #68223, it would not solve the issue that all the data needs to be present in memory at once. In the meantime I'll say to anyone else searching for a solution, if you have access to the underlying output Stream, then this ugly workaround should be viable: static void WriteStreamValue(Stream outputStream, Utf8JsonWriter writer, Stream inputStream)
{
writer.Flush();
outputStream.Write(Encoding.UTF8.GetBytes("\""));
using (CryptoStream base64Stream = new CryptoStream(outputStream, new ToBase64Transform(), CryptoStreamMode.Write, true))
{
inputStream.CopyTo(base64Stream);
}
writer.WriteRawValue("\"", skipInputValidation: true);
} |
Reopening. |
@eiriktsarpalis This wouldn't be usable with custom converters in its current form. This works for types the JSON serializer knows about. |
We have a similar issue here with a chain of custom converters and large amounts of data ( |
The proposed APIs are necessary but not sufficient to support large string streaming in custom converters. The second missing ingredient is making resumable converters public, tracked by #63795. |
namespace System.Text.Json;
public partial class Utf8JsonWriter
{
// Existing
// public void WriteStringValue(ReadOnlySpan<char> value);
// public void WriteStringValue(ReadOnlySpan<byte> value);
// public void WriteBase64String(ReadOnlySpan<byte> value);
public void WriteStringValueSegment(ReadOnlySpan<char> value, bool isFinalSegment);
public void WriteStringValueSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
public void WriteBase64StringSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
} |
I came across this issue when trying to create a
No need to pass a flag as the Stream is self-describing. It makes for efficient CopyTo, chaining, etc., and is memory-efficient as long as all the participating Stream types are chunking. |
@eiriktsarpalis @terrajobst We're facing similar issues with OData libraries for some customers who have large text or byte[] fields in their payloads. I like the |
Back in the day, #68223 (comment) removed the public void WriteRawValue(ReadOnlySequence<char> json, bool skipInputValidation = false); overload due to the surrogate pair handling necessary for that. This new API here would require that anyways and it would also be the building block for this method's implementation. It might be a good time to reintroduce this. It would also tie in nicely with #97570 for example |
AI-related use case: When receiving a response from an LLM in JSON format (e.g., with response format = { "confidence": 1, "citations": [...], "answer_text": "A really long string goes here, so long that you want to see it arrive incrementally in the UI" } Currently that's not really viable to do with System.Text.Json. While we don't have an API design in mind, here are some possibilities:
Evidence requiredTBH we're still short on evidence that people really need to do this when working with an LLM, because:
Here's one bit of evidence that people will want to parse large strings in streaming JSON: https://www.boundaryml.com/blog/nextjs-rag-streaming-example. Again, it's not the only way to do this, but suggests some people will want to. If anyone reading this has clear examples of LLM-related cases where they find it desirable to process a JSON response in a streaming way, please let us know! |
@eiriktsarpalis I'm interested in contributing to this, our codebase still relies on less-than-ideal workarounds. I see that there was another PR that attempted to address this and was closed. I'd like to get some context on why the PR was closed to I'm well aligned with expectations. Also wanted to confirm that the scope of approved APIs only covers Utf8JsonWriter and not JsonSerializer or JsonConverters or Utf8JsonReader, is that correct? Also, the different proposed methods for |
@habbes can you point out the PR you're referring to? I couldn't immediately find it skimming through the issue history. |
@eiriktsarpalis this one: #101356 |
Seems like it was auto-closed due to lack of activity after receiving some comments and being marked as draft. |
EDIT See #67337 (comment) for an API proposal.
Background and motivation
I have a requirement to write large binary content in json.
In order to do this, I need to encode it to base64 before writing.
The resultant json looks like this:
I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes.
I then convert the list into a byte array, then convert it to base64 string and use
WriteStringValue
to write it.The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it.
Memory consumption is critical when using Utf8JsonWriter in the override of
JsonConverter.Write()
method in a web application.Instead I am proposing a way to stream large binary content.
API Proposal
API Usage
Alternative Designs
No response
Risks
No response
The text was updated successfully, but these errors were encountered: