Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading large data (binary, text) asynchronously is extremely slow #593

Open
roji opened this issue Jun 5, 2020 · 187 comments
Open

Reading large data (binary, text) asynchronously is extremely slow #593

roji opened this issue Jun 5, 2020 · 187 comments
Labels
📈 Performance Issues that are targeted to performance improvements.

Comments

@roji
Copy link
Member

roji commented Jun 5, 2020

Reading a 10MB VARBINARY(MAX) value asynchronously takes around 5 seconds my machine, while doing the same synchronously takes around 20ms. Increasing the data size to 20MB increases the running time to around 52 seconds.

These numbers were with Microsoft.Data.SqlClient 1.1.3, but 2.0.0-preview4.20142.4 has the same issue (it actually seems slightly slower). Note that I'm not posting final BDN results because the benchmark ran extremely slowly.

Note that this looks very similar to #245, but that issue was fixed for 1.1.0. The difference may be that here we're writing binary data - or possibly the bigger size.

Benchmark code
[MemoryDiagnoser]
public class Benchmarks
{
    const string ConnectionString = "Server=localhost;Database=test;User=SA;Password=Abcd5678;Connect Timeout=60;ConnectRetryCount=0";


    [GlobalSetup]
    public void Setup()
    {
        using var conn = new SqlConnection(ConnectionString);
        conn.Open();

        using var cmd = conn.CreateCommand();
        cmd.CommandText = @"
IF OBJECT_ID('dbo.data', 'U') IS NOT NULL
DROP TABLE data; 
CREATE TABLE data (id INT, foo VARBINARY(MAX))
";
        cmd.ExecuteNonQuery();

        cmd.CommandText = "INSERT INTO data (id, foo) VALUES (@id, @foo)";
        cmd.Parameters.AddWithValue("id", 1);
        cmd.Parameters.AddWithValue("foo", new byte[1024 * 1024 * 10]);
        cmd.ExecuteNonQuery();
    }

    [Benchmark]
    public async ValueTask<int> Async()
    {
        using var conn = new SqlConnection(ConnectionString);
        using var cmd = new SqlCommand("SELECT foo FROM data", conn);
        await conn.OpenAsync();

        return ((byte[])await cmd.ExecuteScalarAsync()).Length;
    }

    [Benchmark]
    public async ValueTask<int> Sync()
    {
        using var conn = new SqlConnection(ConnectionString);
        using var cmd = new SqlCommand("SELECT foo FROM data", conn);
        conn.Open();

        return ((byte[])cmd.ExecuteScalar()).Length;
    }

    static void Main(string[] args)
        => BenchmarkRunner.Run<Benchmarks>();
}

Originally raised in dotnet/efcore#21147

@roji roji added the 📈 Performance Issues that are targeted to performance improvements. label Jun 5, 2020
@ErikEJ
Copy link
Contributor

ErikEJ commented Jun 5, 2020

Cannot see much relation to #245 really.

@DavoudEshtehari
Copy link
Contributor

Here is the result of the test at NetCore & Netfx:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18362.657 (1903/May2019Update/19H1)
Intel Core i7-8665U CPU 1.90GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
  [Host]     : .NET Framework 4.8 (4.8.4121.0), X64 RyuJIT  [AttachedDebugger]
  Job-TUQHOY : .NET Framework 4.8 (4.8.4121.0), X64 RyuJIT

IterationCount=10  LaunchCount=1  WarmupCount=10  
Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Async 3,170.53 ms 222.636 ms 132.487 ms 635000.0000 634000.0000 633000.0000 13155.71 MB
Sync 37.88 ms 2.139 ms 1.415 ms 333.3333 333.3333 333.3333 20 MB
  • ManagedSNI = false
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18362.657 (1903/May2019Update/19H1)
Intel Core i7-8665U CPU 1.90GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.202
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT
  Job-LXIUHK : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT

IterationCount=10  LaunchCount=1  WarmupCount=10  
Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Async 4,731.62 ms 228.407 ms 151.077 ms 3000.0000 2000.0000 1000.0000 30.5 MB
Sync 36.68 ms 2.590 ms 1.713 ms 500.0000 500.0000 500.0000 20 MB
  • ManagedSNI = true
BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18362.657 (1903/May2019Update/19H1)
Intel Core i7-8665U CPU 1.90GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.202
  [Host]     : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT
  Job-OTAQAT : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.22101), X64 RyuJIT

IterationCount=10  LaunchCount=1  WarmupCount=10  
Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Async 4,054.54 ms 474.155 ms 282.162 ms 3000.0000 2000.0000 1000.0000 30.5 MB
Sync 34.23 ms 1.365 ms 0.903 ms 357.1429 357.1429 357.1429 20 MB

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 9, 2020

Working from memory here. Reading a MAX defined field will mean using chunked returns and the complete length of the field may not be known ahead of time so the optimization I put in place for reading won't help. If that's right then you'll see the same horrific escalating performance with increased data sizes as the reader continuously allocates and releases the temporary buffers, those GC numbers seem to support that.

It might be possible to do adapt the cached buffer we've got for multipacket known lengths to work like a list and auto expand as we go. What I can't do is prevent the repeated rescanning of the async packet queue.

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 10, 2020

I've had a look into the profiles and I was wrong. The optimization from #245 is being used because it's reading as PLP data. The excessive memory usage is caused mostly by input packet buffers on the TDS side being lost into the snapshot packet chain. The time escalation is cause by the snapshot chain repeatedly being replayed.

The only currently available mitigations are to increase the packet size to as close to 16K as you can get or to use a stream reader mode which should work on a per-packet basis and not buffer the entire contents (I'm a bit hazy on that for async, it probably needs checking).

The long term fix would be to rewrite async reads to avoid replaying the entire packet chain on every receipt. This is akin to brain surgery. I've looked into this while reviewing other performance issues around async and it isn't something to attempt lightly and if I managed it solo and dropped a PR for it the team they may all justifiably quit in protest.

The

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 10, 2020

Oh, and this in System.Data.Common causes about 1/3 of the allocations and really isn't helpful.

https://github.com/dotnet/runtime/blob/7a57b927c3bdc31b91b7605f25f8598b8e6e58b8/src/libraries/System.Data.Common/src/System/Data/SQLTypes/SQLBinary.cs#L61-L74

We already use type workarounds with an overlay type to access the single value field directly to create SqlBinary instances so we could do the same to get it out but we'd then be handing the user a direct reference to a mutable array instead of a copy an I've no doubt that some idiot somewhere in the world is relying on the copy semantic to make their code work.

@roji
Copy link
Member Author

roji commented Jun 11, 2020

@Wraith2

Oh, and this in System.Data.Common causes about 1/3 of the allocations and really isn't helpful.

You're referring to the fact that the binary data is copied out into an array that's given to the user, right? If so, isn't that the same between sync and async?

In any case, that's indeed a lot of allocations, but the alternative would be to return memory that's owned by SqlClient, right? If I'm understanding things right, that would mean that if the user reads the next row, the reference they get would suddenly start pointing to a totally different value... I did think about this at some point as an advanced/high-perf/"unsafe" API in dotnet/runtime#24899 - but that's tricky because of the lifetime issue, and also because the entire binary data isn't necessarily in memory if CommandBehavior.Sequential is used (as it should if values are big).

Anyway, all that stuff is less important than making async at least perform similar to sync...

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 11, 2020

You're referring to the fact that the binary data is copied out into an array that's given to the user, right? If so, isn't that the same between sync and async?

Yes, and yes.

In any case, that's indeed a lot of allocations, but the alternative would be to return memory that's owned by SqlClient, right?

Each row of data is stored in an array of SqlBuffer instances stored in the reader, The SqlBuffer type is a discriminated union and it'll hold a reference to an SqlBinary. When you move row or move field in sequential mode the buffer instance is dropped to the GC. The backing byte[] array inside SqlBinary is not re-used.

If we returned the array directly it would only have an effect on multiple reads of the same field in the same row and the only way to observe a change of behaviour would be to change the contents of the array and then re-request it from the reader. It's a pretty clear optimization that I'd like to be able to make but I'm pretty sure I can't because someone will have relied on it returning a new array each time.

As you suggest the ideal approach would be to use a ReadOnlyMemory then it'd all be safe but that'll require a runtime api change and for all providers to consume it.

also because the entire binary data isn't necessarily in memory if CommandBehavior.Sequential is used (as it should if values are big).

The entire array is skipped in Sequential mode if the field is not requested, as you say.
In standard mode it will be read into memory and kept until Read() is called to move to the next row if it or any field after it in the column order is read.

Anyway, all that stuff is less important than making async at least perform similar to sync...

I can sort of understand how it might be done but it's daunting and difficult.

@roji
Copy link
Member Author

roji commented Jun 11, 2020

As you suggest the ideal approach would be to use a ReadOnlyMemory then it'd all be safe but that'll require a runtime api change and for all providers to consume it.

Wouldn't it be possible for a provider to simply support GetFieldValue<ReadOnlyMemory<byte>>(), without introducing a new API?

I can sort of understand how it might be done but it's daunting and difficult.

I can certainly understand (and appreciate) that, and I'm definitely not suggesting anyone do anything specific (was just reporting the problem). Along with #547 and possibly other async-related issues, it seems that at least documenting these limitations could help users avoid pitfalls.

BTW do you think there are any workarounds to this, besides switching to sync?

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 11, 2020

Wouldn't it be possible for a provider to simply support GetFieldValue<ReadOnlyMemory>(), without introducing a new API?

I hadn't considered that, nice idea.

Currently the only workaround I can think of is to use a stream read overload because I think those drive the parser per-packet directly so you don't have to replay the packet queue, that'll need verifying.

@roji
Copy link
Member Author

roji commented Jun 11, 2020

Currently the only workaround I can think of is to use a stream read overload because I think those drive the parser per-packet directly so you don't have to replay the packet queue, that'll need verifying.

That at least sounds like a good workaround if it works! Definitely worth benchmarking that.

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 11, 2020

Unless I've got something very wrong it won't work because there's a bug.

            using var conn = new SqlConnection(ConnectionString);
            using var cmd = new SqlCommand("SELECT foo FROM data", conn);
            await conn.OpenAsync();
            using var reader = await cmd.ExecuteReaderAsync(
                System.Data.CommandBehavior.SequentialAccess
            );
            await reader.ReadAsync();
            using var stream =  reader.GetStream(0);
            using var memory = new MemoryStream(16 * 1024);

            await stream.CopyToAsync(memory);
            return (int)memory.Length;

This should work but freezes in a task wait after a single read cycle, the second read never completed. IF you change it to standard mode it'll work but it does so by fetching the entire field and giving you a reader over the byte[]. So no workaround, this needs fixing if the team agree that it's a bug.

@roji
Copy link
Member Author

roji commented Jun 11, 2020

Thanks for testing this... Definitely looks like a bug to me.

IF you change it to standard mode it'll work but it does so by fetching the entire field and giving you a reader over the byte[]

Is this at least fast - in the sense that it resembles the sync perf rather than the async? If so it's at least a workaround, even if not ideal...

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 11, 2020

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Async 2,181.93 ms 42.299 ms 39.567 ms 2000.0000 1000.0000 1000.0000 30.5 MB
StreamAsync 2,162.94 ms 30.055 ms 28.114 ms 2000.0000 1000.0000 1000.0000 40.52 MB
Sync 18.40 ms 0.359 ms 0.414 ms 406.2500 406.2500 406.2500 20 MB

So no. The only way to get the speed benefit Is to put the reader in sequential mode.

// For non-variant types with sequential access, we support proper streaming

There's also an interesting comment elsewhere in the file that says streaming isn't supported when using encryption. I think that makes streaming unusable for a generic consumer like EFCore because you'll have to special case it per driver.

We need to fix the sequential stream issue and then we need to rewrite async reads.

@roji
Copy link
Member Author

roji commented Jun 11, 2020

Thanks for the details @Wraith2.

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 14, 2020

With the fix in #603 the numbers are better:

Method Mean Error StdDev Gen 0 Gen 1 Gen 2 Allocated
Async 2,096.68 ms 14.390 ms 13.460 ms 2000.0000 1000.0000 1000.0000 30.5 MB
StreamAsync 25.42 ms 0.499 ms 0.534 ms 2031.2500 1968.7500 1968.7500 32.36 MB
Sync 16.67 ms 0.041 ms 0.036 ms 406.2500 406.2500 406.2500 20 MB

@roji
Copy link
Member Author

roji commented Jun 17, 2020

That looks quite good, at the very least you can point people to a workaround.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 11, 2024

This is FOSS contribution for me. The PR is stuck because I can't replicate the test issue, if you can help replicate (or even better, resolve) the test issue then the fix can get merged sooner and then I can get part 3 started which is what will fix this issue.

@tibitoth
Copy link

@SimonCropp Note that @Wraith2 is not a Microsoft employee, just an OSS contributor.

@markholst
Copy link

Is Microsoft not expected to take some responsibility?

@weltkante

This comment has been minimized.

@David-Engel
Copy link
Contributor

This issue is a priority for MS and is very visible. It's just been a difficult one to address. I'm certainly appreciative of @Wraith2 's contributions to try and address it. I try to get attention on it whenever I can.

The workaround is to use sync for large types.

@SimonCropp
Copy link
Contributor

@David-Engel

The workaround is to use sync for large types.

that is only a valid workaround if the top level app owns all the sql related code. It is much more difficult to convince sql related libraries to apply the same workaround

@markholst
Copy link

markholst commented Oct 14, 2024

And, to be fair, it is a completely reasonable work-around that has been getting me by for 4 years now in our core framework.

I feel the async method implementation should for now simply wrap the sync method implementation. Move the dangerous code to the side and swap out the implementation when it's ready. It is what I would do. I doubt we'd even be having these conversations if this was done.

Alternatively, if it hasn't already been done, document the heck out of it. Use XML documentation and online documentation making it extremely clear to consumers the dangers/limitations of the current implementation.

For me the journey started in the distributed sql cache framework provided by Microsoft, which I have no control over. It wasn't obvious where the problem was and I remember literally spending hours on this issue chasing an unstable production system.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 15, 2024

That would be async over sync which would cause thread exhaustion problems.

Performance on long values was already increased greatly by #2164 so if you're on a version of SqlClient from the last year you're on the cm32 line on this graph where you used to be on the cm31 line
Image

It will eventually get to be mostly equal to sync in terms of speed but as I've already said I'm having difficulty with a complex test failure I can't reproduce and no-one else seems to be able to debug.

@rogerfar
Copy link

@Wraith2 which issue is that? I'm running into this issue quite frequently so I could potentially help out.

@markholst
Copy link

Wow, that's certainly better.

So I'm picturing both options not great for web servers handling large traffic loads.

I can't imagine the thread exhaustion problem being more significant than the CPU/memory bottlenecks that are introduced?

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 15, 2024

I can't imagine the thread exhaustion problem being more significant than the CPU/memory bottlenecks that are introduced?

Thread exhaustion can be quite dramatic. Everything stops including other async calls not related to any sql because their callbacks aren't schedulable, thread exhaustion is a process wide problem. Slow queries only affects code actively running queries so it's much better.

@markholst
Copy link

Fair enough.

I certainly have a lot of experience with optimising CPU/memory bottlenecks, and very little with threadpool starvation, although I have been using async/await since the very early days so may have dodged it.

Which brings us back to square one, there is no perfect solution at this point.

Honestly this is the type of problem I would genuinely to spend hours solving, but it'd be a hard sell running past my boss.

@SeanFeldman
Copy link

i think @SeanFeldman had to do this just a few weeks ago.

We had to deal with it for a few months. What misled us was our assumption that the async issue only applies to reading large data. As it appears, that's not the case. Here are some details of our use case.

The application performs extremely frequent short calls to read a small amount of data via a stored procedure. SPROC duration is, on average, 3 msec. We had a limit of 100 connections, which later was divided into 20 dedicated connections for writes and 80 for reads. Our trouble was always with the reads, which were timing out badly. When more investigation took place on the SQL Server side, it was noticed that we're running out of connections.
Image
This explained the failure to retrieve data as requests were simply timing out.

Once the code was modified solely to use synchronous querying instead of asynchronous, the connection issue disappeared, and performance was restored.
Image

It seems like there's an issue with asynchronous API and connection pooling. I'm not an expert in claiming that, but that's what it looks like. Hopefully, this information helps.

@inloox-dev
Copy link

inloox-dev commented Oct 17, 2024

@SeanFeldman Is your app running on Azure App Service? It looks like it could be SNAT exhaustion. The maximum SNAT connection limit depends on the VM size. In our case, it was 128. We resolved the issue by integrating our App Service into a VNet.

@Wraith2 Great job on async-cm32. We had numerous issues and downtimes with our app, and we had to switch to synchronous. Sinc 5.2.0 we'Re back on async. It's working much better now. Are there any improvements expected in this area for version 6?

@David-Engel
Copy link
Contributor

Are there any improvements expected in this area for version 6?

Watch Wraith's #2714 PR. Just trying to work out the bugs over there. There seem to be some edge case failures that are difficult to pin down.

@SeanFeldman
Copy link

Is your app running on Azure App Service?

@inloox-dev it doesn't. It's doesn't. However, the issue is not the number of connections. Notice that the connections went down when we switched from async to sync API.

@JelleHissink
Copy link

Finally it seems to bubble up: https://www.youtube.com/watch?v=7iuUh0BIAuE
I hope this issue gets some more love and attention because of this.

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 22, 2024

We're in the process of fixing the issue. What more attention do you want?

@JelleHissink
Copy link

@Wraith2 I'm sorry, I'm not complaining about anybody who helps with this issue. I am grateful that with the modifications you made sync and async are getting more equal in terms of performance.

However I had a number of jobs and somehow it is a constant struggle convincing people to not do async blob related work. The main thing is, MSSql is a Microsoft product, the .Net stack is driven by Microsoft, but somehow there exists a huge performance issue that takes years to plan/resolve... I think there is an issue with not recognizing the impact of this issue.

That youtube post at least gives me something where I can redirect colleagues to.

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 23, 2024

I could write up a blog post with the history and details if you want.

@KLuuKer
Copy link

KLuuKer commented Nov 24, 2024

I could write up a blog post with the history and details if you want.

I'd rather you just keep working on the (super duper amazing) pull's to fix the perf. 😄
(but you do you, and if it gets posted i will still happily read it)

@edwardneal
Copy link
Contributor

What's the consensus around backporting cm32 to v5.1? It feels like a lot of the times this issue is re-raised come from using v5.1 because it's the most recent LTS release (either directly, or via EF Core.) Even assuming that v6.0 will be an LTS release, there's still another year of support left for v5.1.

I'm happy to perform that backport, but I can see a chain of nine PRs which look like they contributed to (or were prerequisites of) the v5.2 fix. That's a larger set of changes than v5.1's been serviced with before, and I'm conscious that soon the situation might change and people could update to v6.0 without stepping off the LTS train.

The PRs I'm looking at are #1866; #1544; #2121; #2122; #2126; #2132; #2144; #2157; #2164. Would this safely cover cm32 and its prerequisites @Wraith2 ?

@ErikEJ
Copy link
Contributor

ErikEJ commented Nov 28, 2024

Don't do it. Users expect 5.1.x to be stable. Spend effort on 6.0 and 6.1 / 7.0 instead.

Users can easily opt in to use 5.2.2 or 6.0

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 28, 2024

What users need is a stable version with the cm32 changes and 6 seems like the best way to do that. Unless there is some reason user may not be able to migrate from 5.1 to 6.0 I would suggest letting the changes go stable with 6.0.

@edwardneal
Copy link
Contributor

That makes sense, thanks. I can't think of any supportable speedbumps between 5.1 and 6.0, and an upgrade between two LTS versions seems like a fair enough way to get the changes...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📈 Performance Issues that are targeted to performance improvements.
Projects
None yet
Development

No branches or pull requests