Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kestrel always outputs data by chunks of 4096 (HTTP 1.1) and 65536 (HTTP 2) bytes maximum #30545

Closed
smourier opened this issue Mar 1, 2021 · 29 comments
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions Perf
Milestone

Comments

@smourier
Copy link

smourier commented Mar 1, 2021

I'm using a .NET 5 (5.0.3) on a Windows 10 20H2 19042.844.

Whatever I try, my ASP.NET Core Kestrel server always writes packets to the wire by a maximum size which is 4096 (HTTP 1.1) and 65536 (HTTP 2).

The issue is the overall "download"" performance is bad (for a set of machines + network that would support bigger chunk size), especially with HTTP 1.1. Sometimes the packets size is only a few bytes.

Here is a reproducing code:

class Program
{
    static void Main(string[] args) => CreateHostBuilder(args).Build().Run();

    public static IHostBuilder CreateHostBuilder(string[] args) =>
        Host.CreateDefaultBuilder(args)
            .ConfigureWebHostDefaults(webBuilder =>
            {
            	// uncomment for HTTP 2
            	//serverOptions.ConfigureEndpointDefaults(listenOptions =>
            	//{
            		  //    listenOptions.Protocols = HttpProtocols.Http2;
            	//});
                webBuilder.UseStartup<Startup>();
            });
}

class Client
{
    public async Task Download()
    {
        var client = new HttpClient();
        var url = "http://localhost:5000/Test/Download";

        var req = new HttpRequestMessage(HttpMethod.Get, url)
        {
          	// uncomment for HTTP 2
            //Version = HttpVersion.Version20,
            //VersionPolicy = HttpVersionPolicy.RequestVersionOrHigher
        };
        
        var resp = await client.SendAsync(req, HttpCompletionOption.ResponseHeadersRead).ConfigureAwait(false);
        var completed = 0L;
        using (var stream = await resp.Content.ReadAsStreamAsync().ConfigureAwait(false))
        {
            var buffer = new byte[0x100000];
            do
            {
                var read = await stream.ReadAsync(buffer.AsMemory(0, buffer.Length)).ConfigureAwait(false);
                if (read == 0)
                    break;

				// read here is always 4096 or 65536!
                completed += read;
                Console.WriteLine("Read: " + read);
            }
            while (true);
        }

        Console.WriteLine("Completed: " + completed);
    }
}

class Startup
{
    public Startup(IConfiguration configuration)
    {
        Configuration = configuration;
    }

    public IConfiguration Configuration { get; }

    public void ConfigureServices(IServiceCollection services)
    {
        services.AddControllers();
    }

    public void Configure(IApplicationBuilder app, IWebHostEnvironment env)
    {
        app.UseHttpsRedirection();
        app.UseRouting();
        app.UseEndpoints(endpoints =>
        {
            endpoints.MapControllers();
        });

		// run a client in another thread
        Task.Run(() => new Client().Download());
    }
}

[ApiController]
[Route("[controller]")]
public class TestController : ControllerBase
{
    [HttpGet]
    [Route("download")]
    public IActionResult Download()
    {
        // create some temp file
        var path = Path.Combine(Path.GetTempPath(), "MyTempFile.bin");
        if (!System.IO.File.Exists(path))
        {
            using (var file = new FileStream(path, FileMode.Create))
            {
                file.SetLength(0x400000); // 4MB
            }
        }

        // stream it out
        var stream = System.IO.File.OpenRead(path);
        return new FileStreamResult(stream, "application/octet-stream");
    }
}

This is my launchSettings.json:

{
  "iisSettings": {
	"windowsAuthentication": false,
	"anonymousAuthentication": true,
	"iisExpress": {
	  "applicationUrl": "http://localhost:54611/",
	  "sslPort": 44346
	}
  },
  "profiles": {
	"IIS Express": {
	  "commandName": "IISExpress",
	  "environmentVariables": {
		"ASPNETCORE_ENVIRONMENT": "Development"
	  }
	},
	"KTest": {
	  "commandName": "Project",
	  "environmentVariables": {
		"ASPNETCORE_ENVIRONMENT": "Development"
	  },
	  "applicationUrl": "https://localhost:5001;http://localhost:5000"
	}
  }
}    	

Things I've tried:

  • configure Kestrel using all KestrelServerOptions.Limits to the max
  • I've also written a custom FileStreamResultExecutor (because the default one seems unconfigurable and uses a 65536-size buffer) with a buffer size > 65536 (obviously only for HTTP/2 in my test), it's used, but it doesn't change anything to the 65536 limit.
  • I've tested other web servers just to make sure, and they don't do that

I think this comes from the internal ConcurrentPipeWriter but I'm unsure why.

Is this expected? Or is this a configuration problem? Is it an OS problem? How can I get bigger chunks?

@davidfowl
Copy link
Member

I've also written a custom FileStreamResultExecutor (because the default one seems unconfigurable and uses a 65536-size buffer) with a buffer size > 65536 (obviously only for HTTP/2 in my test), it's used, but it doesn't change anything to the 65536 limit.

Kestrel internally uses a 4K buffer size but that doesn't necessarily mean you'll see 4K buffers on the client side. There's also no buffering by default so things will be written to the output as fast as they can be. My guess is that combination of things is why you're seeing 4K chunks on the client side in this particular case.

  • There's no latency because client and server are on the same machine
  • And there's no buffering

I've tested other web servers just to make sure, and they don't do that

Which ones and did you run the same test where client and server were in the same process? If so, can I see the test?

Is this expected? Or is this a configuration problem? Is it an OS problem? How can I get bigger chunks?

You may be able to see bigger chunks in your scenario by using the BodyWriter (PipeWriter) directly. You can manually buffer bigger chunks that then try flushing it all at once.

PS: I can reproduce it with your sample, I'm going to do some digging and get back to you.

@Tratcher
Copy link
Member

Tratcher commented Mar 1, 2021

Note for HTTP/2 the client can limit the max frame size with the SETTINGS_MAX_FRAME_SIZE connection setting. The default is 16kb.
https://httpwg.org/specs/rfc7540.html#SETTINGS_MAX_FRAME_SIZE

@smourier
Copy link
Author

smourier commented Mar 1, 2021

Hi, thanks for your quick answer.

For the other servers, my real code is more complicated and is out-of-process (I built the inproc only for the reproducing code). I actually spent quite a time to search for the reason.

Other tests I've done:

  • From a local IIS Express server, getting a static file. The response chunk is somewhere around 1 or 3 MBytes for HTTP1.1 and a bit under 10M (near the input buffer) for HTTP2.
  • From remote servers on the web (HTTP1.1). Most send various size chunks, but usually bigger than 4096.

Beyond the fact I would be interested to get some throughput performance w/o any specific code, I've tried to use a custom feature deriving from IHttpResponseBodyFeature, but I seemed to me the Writer property (PipeWriter) wasn't set and is null (the Stream property isn't). I created a custom PipeWriter and returned it in my IHttpResponseBodyFeature but it's never used / called (contrary to the Stream), so I'm unsure how I'm supposed to use a custom writer. Doc is a bit sparse on all this IMHO.

I was wondering if this could be related to this: dotnet/runtime#43480

Note I've also tested the "old" WebClient, just in case, and it does the same.

@davidfowl
Copy link
Member

davidfowl commented Mar 1, 2021

@smourier when I use the built in FileStreamResult on HTTP/1.1 I get this:

Read: 3958
Read: 1048576
Read: 1048576
Read: 520330
Read: 65536
Read: 65536
Read: 65536
Read: 65536
Read: 65536
Read: 851968
Read: 65536
Read: 65536
Read: 65536
Read: 65475
Read: 61
Read: 65536
Read: 65536
Completed: 4194304

Can you share your custom FileStreamResultExecutor implementation?

When using the BodyWriter, we wrap each 4K buffer chunk with the chunked encoding prefix and suffix (size and new line). That might be what you're seeing.

// stream it out
await Response.StartAsync();

var stream = System.IO.File.OpenRead(path);

await stream.CopyToAsync(Response.BodyWriter);

That ends up looking like this:

Read: 3988
Read: 4089
Read: 4089
Read: 4089
Read: 4089
Read: 4089
Read: 4089
Read: 4089
Read: 4089
...
Completed: 4194304

To solve this, set the content length (if you have it):

var stream = System.IO.File.OpenRead(path);

Response.ContentLength = stream.Length;

await Response.StartAsync();

await stream.CopyToAsync(Response.BodyWriter);

This will turn off chunking and work as expected. You can also write a more controlled copy if you want to affect the buffer size but that's up to your implementation.

@smourier
Copy link
Author

smourier commented Mar 1, 2021

If I use your code w/o content length I get this:

Read: 3988 // first
Read: 4089
Read: 4089
Read: 4089
...
Read: 3180 // last

If I use your code with content length I get this:

Read: 3998 // first
Read: 4096
Read: 4096
Read: 4096
...
Read: 98 // last

I never saw something like the result in the top of your last answer. That kind of result would be fine for me.

I didn't put the custom FileStreamResultExecutor code because it doesn't change anything (HTTP 1.1, HTTP2) I use the code as I've posted it with the built-in FileStreamResult (I mean the one from .NET 5 Microsoft.AspNetCore.Mvc.Core.dll). But I thought you said you could reproduce.

@davidfowl
Copy link
Member

@smourier I was wrong. I tweaked the code slightly and it doesn't reproduce unless you use the PipeWriter directly.

@davidfowl
Copy link
Member

This is what I see with your sample and no code changes:

Read: 3958
Read: 1048576
Read: 1048576
Read: 979082
Read: 65536
Read: 786432
Read: 65536
Read: 65536
Read: 65536
Read: 65536
Completed: 4194304

@davidfowl
Copy link
Member

OK turns out I'm not telling the truth again. Your code does reproduce the problem. I'm looking at what I changed...

@davidfowl
Copy link
Member

AHHH I removed the HTTPS redirect!

@smourier
Copy link
Author

smourier commented Mar 1, 2021

Yes, that does it! The doc says it's recommended to use: https://docs.microsoft.com/en-us/aspnet/core/security/enforcing-ssl

If you have some spare cycle to share your thoughts on this, otherwise it's good for me :-)

@davidfowl
Copy link
Member

This isn't good 😢 . I'm still debugging.

@davidfowl davidfowl added the Perf label Mar 1, 2021
@BrennanConroy BrennanConroy added this to the Next sprint planning milestone Mar 1, 2021
@ghost
Copy link

ghost commented Mar 1, 2021

Thanks for contacting us.
We're moving this issue to the Next sprint planning milestone for future evaluation / consideration. We will evaluate the request when we are planning the work for the next milestone. To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@davidfowl
Copy link
Member

I think I've narrowed it down to the client side. Checking to see if I can reproduce it with SSlStream directly.

@davidfowl
Copy link
Member

I have a simple repro just writing a 64K byte[] to the client side. There's good news and there's bad news... 😄

This is what it looks like in wire shark in one of the runs:

image
The payload here gets sent in 2 TCP packets.

The TLS frames are 4120 bytes in size:

image

There are 2 inefficiencies here:

  1. Scatter gather IO isn't supported on Streams (Consider adding Scatter/Gather IO on Stream runtime#25344). This means that when the server buffers data into 4K chunks, it ends up writing them individually to the underlying Stream (SSlStream in this case). That's why we end up with 4K TLS records.
  2. SSLStream reads a single TLS record at a time even though the 65K buffer is passed in, it doesn't eagerly try to fill the buffer that's passed in.

That's why you're seeing 4K reads on the client side when HTTPS is turned on. The good news is, the data is all buffered on the client side, the bad news is you end up going through more reads to gets the same amount of data.

TLS records are 16K max so it doesn't make sense on the server side trying to write with a single write bigger than that (though the implementation will chunk those writes for you). On the client side today, the best you can do with SSLStream connections is read 16K. I'm not sure if that helps your scenario though.

This what wire shark looks like when you do 16K writes from the server side:

image

cc @wfurt @stephentoub @scalablecory

@davidfowl
Copy link
Member

@wfurt we should be bulk reading TLS records so we can consume as much as possible to fill the provided buffer. I’ll open an issue.

@davidfowl
Copy link
Member

The issue is here dotnet/runtime#49000

@wfurt
Copy link
Member

wfurt commented Mar 2, 2021

we may. I did some of it for handshake but the encrypt/decrypt is still open for improvements. I'm not sure if we would be able to process partial frames or really process more than one frame at the time. But we may be able to loop internally and process as much data as we can. For the reading, the trade of is memory. We can read more at once but that would need more memory per connection. That is probably cheap enough for the client side but the server may see bigger aggregate impact. Of course Kestrel can also read as much as it can from kernel and SslStream could consume it without context switches.

@davidfowl
Copy link
Member

SSLStream can reduce memory usage by mimicking what WebSockets do here. Only have enough internal buffer to represent the message header, then use the user provided buffer to populate the payload a decrypt in place. That way you'd only need to store enough buffer to do the handshake and to read TLS record headers.

@wfurt
Copy link
Member

wfurt commented Mar 2, 2021

SslStream does in place encrypt/decrypt but only for one freame. This is certainly seething I can look it. Do you have some good repro from your experiment @davidfowl?
I'll need to think about it more, but I think it still woke be better on large payload to encrypt multiple 16K frames and do only single write to kernel. Than hat can all be sent out as much as TCP windows allow. For Loopback the MTU is ~ 65K.

@davidfowl
Copy link
Member

SslStream does in place encrypt/decrypt but only for one freame

SSL Stream currently encrypts into its own internal buffer and then copies to the user provided buffer. You might be able to skip that altogether and just use the user buffer directly.

Do you have some good repro from your experiment @davidfowl?

Yep the issue here dotnet/runtime#49000 has a self contained runnable sample.

I'll need to think about it more, but I think it still woke be better on large payload to encrypt multiple 16K frames and do only single write to kernel. Than hat can all be sent out as much as TCP windows allow. For Loopback the MTU is ~ 65K.

I think there are different things to solve. You can do a single write to the underlying transport but it would mean adding the new virtual method to stream to allow for passing an array of buffers (see the other linked issue).

We need to solve both issues because they do affect both server and client performance (upload vs download).

@Drawaes
Copy link
Contributor

Drawaes commented Mar 2, 2021

I think one of the issues for SSL Stream about decrypting in place when I originally looked at it was "breaking" the stream contract.

The reason? no other stream that is "read" destroys the data you sent into a stream, eg

FileStream.Read(byteBuffer, 0 , byteBuffer.Length);

after the call byteBuffer is not what you passed in.

I will take a look at my notes.

@davidfowl
Copy link
Member

@Drawaes I don't know if that's true here since the purpose of the caller passing in a buffer is to have it written into.

@Tratcher
Copy link
Member

Tratcher commented Mar 2, 2021

The concern was around encrypting writes in place and destroying data in the write buffer. Mutating the read buffer should be fine.

@davidfowl
Copy link
Member

Right, that isn't the problem here.

@davidfowl
Copy link
Member

The issue is the overall "download"" performance is bad (for a set of machines + network that would support bigger chunk size), especially with HTTP 1.1. Sometimes the packets size is only a few bytes.

@smourier Do you have any measurements? We understand the fix here but don't have a good benchmark to verify if it solves the problem.

@davidfowl
Copy link
Member

We're going to use your sample as a starting point.

@wfurt
Copy link
Member

wfurt commented Mar 11, 2021

BTW loopback is somewhat weird. I'm not sure about windows (as there is no real interface) but on Linux MTU is 64K. So either side can write as much in single step. The question is how that really impacts the throughput - especially when latency and round-trip is not in the play like it would be on real network.

I guess doing this on large file (10G+) and measuring download time would be good approximation, right @davidfowl?
And perhaps comparing that to IIS (Express) ?

@davidfowl
Copy link
Member

davidfowl commented Mar 11, 2021

I guess doing this on large file (10G+) and measuring download time would be good approximation, right @davidfowl?

Yep! If you need help setting up this scenario.

Also YARP had a really interesting scenario we were looking at this morning:

Client < - > YARP < -> gRPC streaming

In the above scenario the gRPC streams were 100B in size therefore writing 100B over the network in potentially bigger than 100B chunks (kestrel does write behind buffering in order to batch writes). As a result YARP tries to copy 65K response chunks to the client but because of this issue, ends up reading 100 byte TLS frames and writes that to the client instead.

The cost here is that when the async read yields, we do more work (write to the client) in much smaller frames than we could have if SSL Stream tried to fill our buffer.

@davidfowl
Copy link
Member

Closing this because the root cause is dotnet/runtime#49000

@ghost ghost locked as resolved and limited conversation to collaborators Apr 29, 2021
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions Perf
Projects
None yet
Development

No branches or pull requests

8 participants