Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to stream large strings in Utf8JsonWriter/Utf8JsonReader #67337

Open
anhadi2 opened this issue Mar 30, 2022 · 38 comments
Open

Add ability to stream large strings in Utf8JsonWriter/Utf8JsonReader #67337

anhadi2 opened this issue Mar 30, 2022 · 38 comments
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Json tenet-performance Performance related issue
Milestone

Comments

@anhadi2
Copy link

anhadi2 commented Mar 30, 2022

EDIT See #67337 (comment) for an API proposal.

Background and motivation

I have a requirement to write large binary content in json.
In order to do this, I need to encode it to base64 before writing.
The resultant json looks like this:

{
  "data": "large_base64_encoded_string"
}

I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes.
I then convert the list into a byte array, then convert it to base64 string and use WriteStringValue to write it.

public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
    writer.WriteStartObject();
    writer.WritePropertyName("data");
    byte[] byteArray = ReadBinaryData(reader).;
    string base64data = Convert.ToBase64String(byteArray);
    writer.WriteStringValue(base64data);
    writer.WriteEndObject();
}

public byte[] ReadBinaryData(PipeReader reader)
{
    List<byte> bytes = new List<byte>();
    while (reader.TryRead(out ReadResult result))
    {
        ReadOnlySequence<byte> buffer = result.Buffer;
        bytes.AddRange(buffer.ToArray());

        // Tell the PipeReader how much of the buffer has been consumed.
        reader.AdvanceTo(buffer.End);

        // Stop reading if there's no more data coming.
        if (result.IsCompleted)
        {
            break;
        }
    }

    // Mark the PipeReader as complete.
    await reader.Complete();
    byte[] byteArray = bytes.ToArray();
    return byteArray;
}

The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it.

Memory consumption is critical when using Utf8JsonWriter in the override of JsonConverter.Write() method in a web application.

Instead I am proposing a way to stream large binary content.

API Proposal

namespace System.Text.Json
{
    public class Utf8JsonWriter : IAsyncDisposable, IDisposable
    {
        // This returns a stream on which binary data can be written.
        // It will encode it to base64 and write to output stream.
        public Stream CreateBinaryWriteStream();
    }
}

API Usage

public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
    writer.WriteStartObject();
    writer.WritePropertyName("data");
    using (Stream binaryStream = writer.CreateBinaryWriteStream())
    {
        StreamBinaryData(reader, binaryStream);
        binaryStream.Flush();
    }
    writer.WriteEndObject();
}

public void StreamBinaryData(PipeReader reader, Stream stream)
{
    List<byte> bytes = new List<byte>();
    while (reader.TryRead(out ReadResult result))
    {
        ReadOnlySequence<byte> buffer = result.Buffer;
        byte[] byteArray = buffer.ToArray();
        stream.Write(byteArray, 0, byteArray.Length);
        stream.Flush();
        // Tell the PipeReader how much of the buffer has been consumed.
        reader.AdvanceTo(buffer.End);

        // Stop reading if there's no more data coming.
        if (result.IsCompleted)
        {
            break;
        }
    }

    // Mark the PipeReader as complete.
    await reader.Complete();
}

Alternative Designs

No response

Risks

No response

@anhadi2 anhadi2 added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Mar 30, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.IO untriaged New issue has not been triaged by the area owner labels Mar 30, 2022
@ghost
Copy link

ghost commented Mar 30, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

I have a requirement to write large binary content in json.
In order to do this, I need to encode it to base64 before writing.
The resultant json looks like this:

{
  "data": "large_base64_encoded_string"
}

I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes.
I then convert the list into a byte array, then convert it to base64 string and use WriteStringValue to write it.

public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
    writer.WriteStartObject();
    writer.WritePropertyName("data");
    byte[] byteArray = ReadBinaryData(reader).;
    string base64data = Convert.ToBase64String(byteArray);
    writer.WriteStringValue(base64data);
    writer.WriteEndObject();
}

public byte[] ReadBinaryData(PipeReader reader)
{
    List<byte> bytes = new List<byte>();
    while (reader.TryRead(out ReadResult result))
    {
        ReadOnlySequence<byte> buffer = result.Buffer;
        bytes.AddRange(buffer.ToArray());

        // Tell the PipeReader how much of the buffer has been consumed.
        reader.AdvanceTo(buffer.End);

        // Stop reading if there's no more data coming.
        if (result.IsCompleted)
        {
            break;
        }
    }

    // Mark the PipeReader as complete.
    await reader.Complete();
    byte[] byteArray = bytes.ToArray();
    return byteArray;
}

The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it.

Instead I am proposing a way to stream large binary content.

API Proposal

namespace System.Text.Json
{
    public class Utf8JsonWriter : IAsyncDisposable, IDisposable
    {
        // This returns a stream on which binary data can be written.
        // It will encode it to base64 and write to output stream.
        public Stream CreateBinaryWriteStream();
    }
}

API Usage

public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
    writer.WriteStartObject();
    writer.WritePropertyName("data");
    using (Stream binaryStream = writer.CreateBinaryWriteStream())
    {
        StreamBinaryData(reader, binaryStream);
        binaryStream.Flush();
    }
    writer.WriteEndObject();
}

public void StreamBinaryData(PipeReader reader, Stream stream)
{
    List<byte> bytes = new List<byte>();
    while (reader.TryRead(out ReadResult result))
    {
        ReadOnlySequence<byte> buffer = result.Buffer;
        byte[] byteArray = buffer.ToArray();
        stream.Write(byteArray, 0, byteArray.Length);
        stream.Flush();
        // Tell the PipeReader how much of the buffer has been consumed.
        reader.AdvanceTo(buffer.End);

        // Stop reading if there's no more data coming.
        if (result.IsCompleted)
        {
            break;
        }
    }

    // Mark the PipeReader as complete.
    await reader.Complete();
}

Alternative Designs

No response

Risks

No response

Author: anhadi2
Assignees: -
Labels:

api-suggestion, area-System.IO, untriaged

Milestone: -

@ghost
Copy link

ghost commented Mar 30, 2022

Tagging subscribers to this area: @dotnet/area-system-text-json, @gregsdennis
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

I have a requirement to write large binary content in json.
In order to do this, I need to encode it to base64 before writing.
The resultant json looks like this:

{
  "data": "large_base64_encoded_string"
}

I have a PipeReader using which I read bytes in loop and keep appending to a list of bytes.
I then convert the list into a byte array, then convert it to base64 string and use WriteStringValue to write it.

public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
    writer.WriteStartObject();
    writer.WritePropertyName("data");
    byte[] byteArray = ReadBinaryData(reader).;
    string base64data = Convert.ToBase64String(byteArray);
    writer.WriteStringValue(base64data);
    writer.WriteEndObject();
}

public byte[] ReadBinaryData(PipeReader reader)
{
    List<byte> bytes = new List<byte>();
    while (reader.TryRead(out ReadResult result))
    {
        ReadOnlySequence<byte> buffer = result.Buffer;
        bytes.AddRange(buffer.ToArray());

        // Tell the PipeReader how much of the buffer has been consumed.
        reader.AdvanceTo(buffer.End);

        // Stop reading if there's no more data coming.
        if (result.IsCompleted)
        {
            break;
        }
    }

    // Mark the PipeReader as complete.
    await reader.Complete();
    byte[] byteArray = bytes.ToArray();
    return byteArray;
}

The problem with this approach is excessive memory consumption. We need to keep the whole binary content in memory, convert to base64 and then write it.

Memory consumption is critical when using Utf8JsonWriter in the override of JsonConverter.Write() method in a web application.

Instead I am proposing a way to stream large binary content.

API Proposal

namespace System.Text.Json
{
    public class Utf8JsonWriter : IAsyncDisposable, IDisposable
    {
        // This returns a stream on which binary data can be written.
        // It will encode it to base64 and write to output stream.
        public Stream CreateBinaryWriteStream();
    }
}

API Usage

public void WriteBinaryContent(PipeReader reader, Utf8JsonWriter writer)
{
    writer.WriteStartObject();
    writer.WritePropertyName("data");
    using (Stream binaryStream = writer.CreateBinaryWriteStream())
    {
        StreamBinaryData(reader, binaryStream);
        binaryStream.Flush();
    }
    writer.WriteEndObject();
}

public void StreamBinaryData(PipeReader reader, Stream stream)
{
    List<byte> bytes = new List<byte>();
    while (reader.TryRead(out ReadResult result))
    {
        ReadOnlySequence<byte> buffer = result.Buffer;
        byte[] byteArray = buffer.ToArray();
        stream.Write(byteArray, 0, byteArray.Length);
        stream.Flush();
        // Tell the PipeReader how much of the buffer has been consumed.
        reader.AdvanceTo(buffer.End);

        // Stop reading if there's no more data coming.
        if (result.IsCompleted)
        {
            break;
        }
    }

    // Mark the PipeReader as complete.
    await reader.Complete();
}

Alternative Designs

No response

Risks

No response

Author: anhadi2
Assignees: -
Labels:

api-suggestion, area-System.Text.Json, untriaged

Milestone: -

@teo-tsirpanis
Copy link
Contributor

I think it should be an IBufferWriter<byte> instead of a Stream. Even better, a struct implementing IBufferWriter, with a Complete method that tells the Utf8JsonWriter that writing the big string finished.

@FiniteReality
Copy link

with a Complete method that tells the Utf8JsonWriter that writing the big string finished.

Personally, I prefer the idea of using IDisposable here and then having the Dispose method act as the Complete method you're talking about. This is similar to how other types work like log scopes.

@teo-tsirpanis
Copy link
Contributor

That's also an option, since this is the type's only purpose.

@ericstj
Copy link
Member

ericstj commented Mar 30, 2022

I have a requirement to write large binary content in json.

What's the reason for this constraint? Wouldn't it be more efficient to serve it up directly instead of embedding in a JSON payload?

@anhadi2
Copy link
Author

anhadi2 commented Mar 31, 2022

I have a requirement to write large binary content in json.

What's the reason for this constraint? Wouldn't it be more efficient to serve it up directly instead of embedding in a JSON payload?

We want to serve multiple binary content in the JSON with some additional fields.
The actual JSON will contain some more fields and response looks like:

{
  "value": [
    {
      "field1": "some_small_string",
      "data": "large_base64_encoded_string"
    },
    {
      "field1": "some_small_string",
      "data": "large_base64_encoded_string"
    }
  ]
}

@krwq
Copy link
Member

krwq commented Mar 31, 2022

@anhadi2 wouldn't it make more sense to append that data after the JSON is complete? I.e.:

{
  "value": [
    {
      "field1": "some_small_string",
      "data": "$binary_payload"
      //or "binary_payload": true or some other marker to tell "I'll be writing rest later"
    },
    {
      "field2": "some_small_string",
      "data": "$binary_payload"
    }
  ]
}
<base64 of payload for field 1, this could even be 4 bytes of length + data directly>
<base64 of payload for field 2>

@anhadi2
Copy link
Author

anhadi2 commented Apr 1, 2022

@anhadi2 wouldn't it make more sense to append that data after the JSON is complete? I.e.:

{
  "value": [
    {
      "field1": "some_small_string",
      "data": "$binary_payload"
      //or "binary_payload": true or some other marker to tell "I'll be writing rest later"
    },
    {
      "field2": "some_small_string",
      "data": "$binary_payload"
    }
  ]
}
<base64 of payload for field 1, this could even be 4 bytes of length + data directly>
<base64 of payload for field 2>

This will not work for us. We want the response to be a valid JSON.

@teo-tsirpanis
Copy link
Contributor

Something you could do is put a URL to the data in your JSON instead of the data themselves. The data will be transmitted in binary and much more efficiently than Base64.

Unless you want to persist this JSON file.

@anhadi2
Copy link
Author

anhadi2 commented Apr 1, 2022

Something you could do is put a URL to the data in your JSON instead of the data themselves. The data will be transmitted in binary and much more efficiently than Base64.

Unless you want to persist this JSON file.

Thanks for the suggestion. Yes, there might be other ways to do this.
However, the response format is already decided and we cannot change it at this stage.
Hence looking for an efficient way to transfer the base 64 encoded binary payload as a part of the JSON response body.

@krwq
Copy link
Member

krwq commented Jun 7, 2022

@anhadi2 would #68223 help for your use case?

@krwq krwq added needs-author-action An issue or pull request that requires more info or actions from the author. and removed untriaged New issue has not been triaged by the area owner labels Jun 7, 2022
@krwq krwq added this to the Future milestone Jun 7, 2022
@ghost
Copy link

ghost commented Jun 7, 2022

This issue has been marked needs-author-action and may be missing some important information.

@ghost ghost added the no-recent-activity label Jun 21, 2022
@ghost
Copy link

ghost commented Jun 21, 2022

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@ghost
Copy link

ghost commented Jul 5, 2022

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

@ghost ghost closed this as completed Jul 5, 2022
@teo-tsirpanis teo-tsirpanis closed this as not planned Won't fix, can't repro, duplicate, stale Jul 5, 2022
@oocx
Copy link

oocx commented Jul 21, 2022

Such an api would be useful for us as well. We are sending data to a 3rd party api that we cannot change, and that api expects files to be sent base64 encoded as part of a json object. The documents can be up to 100 MB in size, we'd like to avoid having to load the complete document into memory.

@ghost ghost removed the no-recent-activity label Jul 21, 2022
@mriehm
Copy link

mriehm commented Jul 30, 2022

We are in the same position of sending large base64 data in JSON to a 3rd party and being unable to change the contract.

@krwq You added the needs-author-action tag which seems to have led to this issue being closed. Since a couple other people have responded now requesting this feature, can it be reopened?

To answer your question about #68223, it would not solve the issue that all the data needs to be present in memory at once.

In the meantime I'll say to anyone else searching for a solution, if you have access to the underlying output Stream, then this ugly workaround should be viable:

static void WriteStreamValue(Stream outputStream, Utf8JsonWriter writer, Stream inputStream)
{
    writer.Flush();
    outputStream.Write(Encoding.UTF8.GetBytes("\""));

    using (CryptoStream base64Stream = new CryptoStream(outputStream, new ToBase64Transform(), CryptoStreamMode.Write, true))
    {
        inputStream.CopyTo(base64Stream);
    }

    writer.WriteRawValue("\"", skipInputValidation: true);
}

@teo-tsirpanis
Copy link
Contributor

Reopening.

@davidfowl
Copy link
Member

@eiriktsarpalis This wouldn't be usable with custom converters in its current form. This works for types the JSON serializer knows about.

@benaadams
Copy link
Member

We have a similar issue here with a chain of custom converters and large amounts of data (0xHEX rather than base64); one of the additional issues is even doing an await JsonSerializer.SerializeAsync while in the converters it builds up the entire tree of Json in that property to memory rather than flushing to the underlying Stream

@eiriktsarpalis
Copy link
Member

The proposed APIs are necessary but not sufficient to support large string streaming in custom converters. The second missing ingredient is making resumable converters public, tracked by #63795.

@terrajobst
Copy link
Member

terrajobst commented Dec 12, 2023

  • Looks good as proposed
  • The methods have to track the state necessary to know
    • when to put an open quote
    • how to parse code points across span boundaries
  • This allows writing large strings, but this doesn't support reading them yet. Do we need to add this as well?
    • The Utf8JsonReader will fetch more buffers until the next token is contained in a single contiguous buffer, which wouldn't be efficient for large strings
    • We can solve this by letting the reader early with a new token type (such as StringSegment) but this would break consumers that aren't handling this token type, but we could make this opt-in on the reader
    • Ideally that opt in isn't state on the reader but is a function of what API is called (e.g. TryGetStringSegment / GetStringSegment()) because it allows multiple parties to opt-in individually (e.g. some converter doesn't but continues to work, a new token type would make everyone throw).
namespace System.Text.Json;

public partial class Utf8JsonWriter
{
    // Existing
    // public void WriteStringValue(ReadOnlySpan<char> value);
    // public void WriteStringValue(ReadOnlySpan<byte> value);
    // public void WriteBase64String(ReadOnlySpan<byte> value);

    public void WriteStringValueSegment(ReadOnlySpan<char> value, bool isFinalSegment);
    public void WriteStringValueSegment(ReadOnlySpan<byte> value, bool isFinalSegment); 
    public void WriteBase64StringSegment(ReadOnlySpan<byte> value, bool isFinalSegment);
}

@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Dec 12, 2023
@JustDre
Copy link

JustDre commented Jan 22, 2024

I came across this issue when trying to create a JsonConverter<Stream>. It seems to me the simplest API would just be

public void WriteBase64String(Stream value);

No need to pass a flag as the Stream is self-describing. It makes for efficient CopyTo, chaining, etc., and is memory-efficient as long as all the participating Stream types are chunking.

@habbes
Copy link
Contributor

habbes commented Jan 25, 2024

@eiriktsarpalis @terrajobst We're facing similar issues with OData libraries for some customers who have large text or byte[] fields in their payloads. I like the WriteStringValueSegment API. We would want to be able to write large string/byte[] values in chunks and flush periodically to avoid resizing the buffer. We would like to flush/write to the stream asynchronously to avoid synchronous I/O.

@eiriktsarpalis eiriktsarpalis removed their assignment Jan 25, 2024
@eiriktsarpalis eiriktsarpalis changed the title Add ability to stream large binary content to Utf8JsonWriter Add ability to stream large strings to Utf8JsonWriter Feb 15, 2024
@eiriktsarpalis eiriktsarpalis changed the title Add ability to stream large strings to Utf8JsonWriter Add ability to stream large strings in Utf8JsonWriter/Utf8JsonReader Apr 4, 2024
@Tragetaschen
Copy link
Contributor

Back in the day, #68223 (comment) removed the

public void WriteRawValue(ReadOnlySequence<char> json, bool skipInputValidation = false);

overload due to the surrogate pair handling necessary for that. This new API here would require that anyways and it would also be the building block for this method's implementation. It might be a good time to reintroduce this.

It would also tie in nicely with #97570 for example

@SteveSandersonMS
Copy link
Member

SteveSandersonMS commented Jun 28, 2024

AI-related use case:

When receiving a response from an LLM in JSON format (e.g., with response format = json_object for OpenAI), it might represent something you want to process in a streaming way, e.g., if it's a chatbot answering a question and you want it to show up in realtime in the UI. For example it might be returning a JSON object like:

{ "confidence": 1, "citations": [...], "answer_text": "A really long string goes here, so long that you want to see it arrive incrementally in the UI" }

Currently that's not really viable to do with System.Text.Json.

While we don't have an API design in mind, here are some possibilities:

  • We could support this at the JsonReader level only. It would have to be able to read from a Stream and have some methods like ReadStringChunkAsync that gives you the next block from a string you're currently reading.
  • Or we could support it in the deserializer by mapping to properties of type Stream. However it's unclear how the developer could know which of the output object properties have been populated by the time the stream starts arriving, as the response object properties could appear in any order.

Evidence required

TBH we're still short on evidence that people really need to do this when working with an LLM, because:

  1. As far as we know, the main reason to want to do streaming at all when working with an LLM is to present the output in realtime in a chat-style UI. But in most cases a chat response will be done as plain text, not JSON. Even if the developer is thinking of using JSON because they want structured info (e.g., an array of citations, or a flag to indicate something about the response type), there are usually better alternatives. Examples:
    • If you want flags indicating the nature of the response, either have that determined by a JSON-returning planning phase that executes before you ask the LLM for its final response, or tell the LLM to embed the flag in its response in some format you decide (e.g., <is_low_certainty>).
    • If you want structured citations, tell the LLM to return them inline in the text in a known format you can find with a regex, e.g., <cite id="123">text</cite>.
  2. In non-chat scenarios, non-UI code will simply wait until the response completes before it takes action on it, so it doesn't need to parse in a streaming way.
  3. Even if you could do this via Utf8JsonReader, it would result in extremely painful code since you couldn't use a regular deserialization API and would instead have to write some kind of state machine that interprets a single specific JSON format.

Here's one bit of evidence that people will want to parse large strings in streaming JSON: https://www.boundaryml.com/blog/nextjs-rag-streaming-example. Again, it's not the only way to do this, but suggests some people will want to.

If anyone reading this has clear examples of LLM-related cases where they find it desirable to process a JSON response in a streaming way, please let us know!

@habbes
Copy link
Contributor

habbes commented Oct 16, 2024

@eiriktsarpalis I'm interested in contributing to this, our codebase still relies on less-than-ideal workarounds. I see that there was another PR that attempted to address this and was closed. I'd like to get some context on why the PR was closed to I'm well aligned with expectations. Also wanted to confirm that the scope of approved APIs only covers Utf8JsonWriter and not JsonSerializer or JsonConverters or Utf8JsonReader, is that correct? Also, the different proposed methods for Utf8JsonWriter can be implemented in separate PRs (for ease of review and workload management), is that correct?

@eiriktsarpalis
Copy link
Member

@habbes can you point out the PR you're referring to? I couldn't immediately find it skimming through the issue history.

@habbes
Copy link
Contributor

habbes commented Oct 16, 2024

@eiriktsarpalis this one: #101356

@habbes
Copy link
Contributor

habbes commented Oct 16, 2024

Seems like it was auto-closed due to lack of activity after receiving some comments and being marked as draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-approved API was approved in API review, it can be implemented area-System.Text.Json tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests