[BigQuery] High Number of Memory Allocations when migrating from legacy insertAll to Storage Write API. #10034

indy-singh · 2023-03-14T16:24:48Z

Environment details

OS: Microsoft Windows Server 2019 Standard 10.0.17763 N/A Build 17763
.NET version: <TargetFramework>net6.0</TargetFramework>
Package name and version:-

<PackageReference Include="Google.Cloud.BigQuery.Storage.V1" Version="3.6.0" />
<PackageReference Include="Google.Cloud.BigQuery.V2" Version="3.2.0" />
<PackageReference Include="Google.Protobuf" Version="3.22.1" />
<PackageReference Include="Google.Protobuf.Tools" Version="3.22.1" />

Steps to reproduce

https://github.com/indy-singh/bigquery_storage_write_api_high_memory_allocation - note that I can't get this to leak, but it isn't 1:1 with production code. Just a test harness in which to try different scenarios.

Additional context: protocolbuffers/protobuf#12217

We have recently migrated from the legacy insertAll to the Storage Write API and we've found that a marked increased in memory (see screenshot in link above). The migration is 1:1 as close as can be.

I've diff'ed the methods ExpensiveStreamingInsert and StorageWriteApi from this file https://github.com/indy-singh/bigquery_storage_write_api_high_memory_allocation/blob/c3b50881b61ccaa8fddfea01945d7f67563cab15/bigquery_storage_write_api_high_memory_allocation/WatchTowerBigQuerySaver.cs (this is the actual production code) and nothing looks super suspicious. Online diff: https://www.diffchecker.com/gAJLZUKz/

Which leads to me believe we are using the Storage Write APIs incorrectly (at construction).

We basically keep one AppendRowsStream open/alive for the duration of the application lifetime (which is typically 12+ hours)

var bigQueryWriteClientBuilder = new BigQueryWriteClientBuilder
{
    Credential = googleCredential,
}.Build();

_appendRowsStream = bigQueryWriteClientBuilder.AppendRows();

Is this incorrect? https://cloud.google.com/bigquery/docs/write-api-best-practices seems to imply what we are doing is correct.

Thanks for any help.

Cheers,
Indy

The text was updated successfully, but these errors were encountered:

jskeet · 2023-03-14T16:46:57Z

That looks fine to me - I can't see anywhere that we'd end up holding on to memory there, modulo my thought below.

I'd be tempted to create a new stream somewhat more often than once every 12 hours, but that's a different matter.

One aspect I hadn't considered before is that your code doesn't appear to be reading from the responses stream from the gRPC stream (see here in the original sample code). I wonder whether that means that data is getting buffered... I don't know enough about the BigQuery Storage Write API, but that might explain things.

Please could you try including some code to read from the response stream in the background, and see if that makes any difference? (If this is the problem, we'll definitely want to update the sample to emphasize it more.)

indy-singh · 2023-03-14T16:56:29Z

Hi @jskeet ,

Happy to do that but I've just trying a slightly different tack; deployed a minor change in which we fetch the default stream every time we want to do a write. The saver pulses every 20 seconds or so, so every 20 seconds it would fetch the default stream, write, and then mark it as complete.

So we basically do:-

var appendRowsStream = _bigQueryWriteClientBuilder.AppendRows();

await appendRowsStream.WriteAsync(new AppendRowsRequest
{
    ProtoRows = protoData,
    WriteStream = _writeStreamName,
});

await appendRowsStream.WriteCompleteAsync();

Inside the while loop.

I'm wondering like you if the missing call to WriteCompleteAsync means that all the data I've ever written stays in a buffer somewhere and then eventually causes a "leak".

Cheers,
Indy

jskeet · 2023-03-14T17:02:49Z

We try to write the data as you go - we definitely don't wait for a WriteComplete method. But gRPC only allows a single request in a stream to be in flight at a time. We use BufferedClientStreamWriter to buffer requests. But actually, as you're awaiting the task returned by WriteAsync, that shouldn't be the problem (unless there's a bug in BufferedClientStreamWriter).

It's a shame that your test code doesn't reproduce the problem at the moment... but I'll give it a try anyway and see if I can see anything odd going on in the BufferedClientStreamWriter in terms of it retaining objects when it shouldn't, for example.

indy-singh · 2023-03-14T17:15:59Z

I think there is an element of time at play here, the migration to the Storage Write API clearly leaks (need to find a better word) from our monitoring tooling POV. But it's not immediate for example, I left the code overnight on DEMO and one of the boxes jumped at 21:00 for no apparent reason I'm the only one in this app company-wide so it has to be the recent changes. I'll let you know what my different tack reveals if anything.

Cheers,
Indy

jskeet · 2023-03-14T17:23:15Z

One interesting idea: in your real code, do you have multiple threads calling WriteAsync? Could the memory jump be explained by all of them effectively blocking, with the underlying linked list of requests just getting longer and longer?

We wouldn't see that in the sample code, as you're waiting for one write to complete before creating another one... but if your real code is more event-based, so you could have parallel threads writing and waiting, that could explain it.

(When running your test code in a debugger, _appendRowsStream._writeBuffer.BufferedWriteCount is always 0 for me.)

indy-singh · 2023-03-14T17:40:40Z

We don't have multiple threads calling WriteAsync but it is a very hot code path on our LIVE environment and the while loop will do the write to BQ under two OR conditions:

If it has over the max batch size (10,000)
Or if it has been x amount of time since the last write (e.g. 20 seconds) and we have at least one thing to write

I can try and reflect over that (_appendRowsStream._writeBuffer.BufferedWriteCount) and output it in the logging tomorrow.

Cheers,
Indy

jskeet · 2023-03-14T17:46:01Z

We don't have multiple threads calling WriteAsync but it is a very hot code path on our LIVE environment

Right. In your real code, do you await the task immediately, or queue it somewhere? (Even if you await it, could another job use the same thread to write again?) In the commented-out code, it does look like you're awaiting it directly, and that only that loop will write to it...

I can try and reflect over that (_appendRowsStream._writeBuffer.BufferedWriteCount) and output it in the logging tomorrow.

That would be really useful, thanks.

indy-singh · 2023-03-14T17:47:26Z

In your real code, do you await the task immediately, or queue it somewhere?

It is awaited directly, and only that loop invokes the write to BQ.

Cheers,
Indy

indy-singh · 2023-03-14T18:46:22Z

Sorry, missed this bit.

Even if you await it, could another job use the same thread to write again?

No. Anything and everything is allowed to call WatchTowerBigQuerySaver.Instance.Enqueue inside this app https://github.com/indy-singh/bigquery_storage_write_api_high_memory_allocation/blob/c3b50881b61ccaa8fddfea01945d7f67563cab15/bigquery_storage_write_api_high_memory_allocation/WatchTowerBigQuerySaver.cs#L486 but in reality only one place does call it.

But all that does is add it to the concurrent queue inside of WatchTowerBigQuerySaver which is then read from inside of the while loop invocation:

Cheers,
Indy

jskeet · 2023-03-14T18:57:36Z

Right - thanks for all the extra detail. I don't see how we could end up buffering anything unless there's a bug in BufferedClientStreamWriter - but that's always a possibility.

It might be worth something logging each time you call WriteAsync, with the number of items and how long it takes for the returned task to complete. Not because I expect anything in particular - just because it looks like "something" happens, and it'd be interesting to see whether the stats for the call change at the same time.

indy-singh · 2023-03-14T22:43:06Z

So I managed to get _appendRowsStream._writeBuffer.BufferedWriteCount to be something other than 0 (it is 1 after the call to WriteComplete) in this commit. I'm not entirely what that means yet. But I know the row(s) are already in BQ before I call WriteComplete.

So I guess that is just a signal to that instance of the stream that "I'm done with you please clean yourself up".

Cheers,
Indy

jskeet · 2023-03-15T08:46:21Z

Yes, the "it's complete" part is added as an extra entry in the queue here, which is spotted in the processing here.

The fact that it's never removed from the queue (because there's no continuation) may or may not be intentional - it's been 6 years since I wrote the first version of this code... but it shouldn't do any harm, and definitely wouldn't explain the high memory usage you're seeing.

It could be that this queuing is a red herring, of course... but it would be nice to rule it out.

Just to check my own thought process here:

You're only converting from your internal format into the API message using ByteStrings just before sending
You've used a profiler to show that a lot of the memory is taken up by ByteStrings

For the second part, of that, was that profiler running against a "real" system (sending messages) or just the early benchmark which did nothing but create the API messages? If it's the latter, it's understandable - but if you're getting a lot of ByteStrings building up in the real system (whether live or demo) then that would suggest it has to be something building up late on, rather than in your own internal queue.

indy-singh · 2023-03-15T10:47:25Z

You're only converting from your internal format into the API message using ByteStrings just before sending

Correct. See here for legacy insert and here for modern insert. I did note that this:-

new ProtoSchema
{
	ProtoDescriptor = WatchtowerRecord.Descriptor.ToProto(),
};

Also caused allocations on every invocation hence why I hoisted it up into the constructor.

You've used a profiler to show that a lot of the memory is taken up by ByteStrings
For the second part, of that, was that profiler running against a "real" system (sending messages) or just the early benchmark which did nothing but create the API messages?

Correct and it was just an early benchmark which did nothing but create the API messages. The reason behind this is legacy insert is well battle tested it has been pushing ~50k total_requests per day / ~150 million total_rows per day / ~100GB total_input_bytes per day since 05 July 2021 10:44:15 I'm using this table to gather that information.

It was a simple diff of the legacy insert vs modern insert, see what actually had changed after which a console app to reproduce and put through dotMemory.

That being said, it's clear now that ToByteString was a red herring (it still does a lot of allocations in Gen0 - ArrayPool?). I now think it's slightly more complicated that as this change has been overnight on our DEMO environment and so far is stable.

My guess is that having a long lived default stream and writing to it over a period of time (lets say hours) without ever calling WriteCompleteAsync can eventually cause a memory pressure/leak situation. I've been trying to get a faster feedback loop so we can remove the time element of this, but it's proving difficult to pin down.

Please could you try including some code to read from the response stream in the background, and see if that makes any difference?

I still want to do this, but I don't want to change too much too quickly when faced with a new problem. Perhaps one more night.

Cheers,
Indy

jskeet · 2023-03-15T10:55:34Z

I did note that this: (ProtoSchema initialization) Also caused allocations on every invocation hence why I hoisted it up into the constructor.

Yes, that was something I was going to suggest - although it wouldn't be doing a lot.

My guess is that having a long lived default stream and writing to it over a period of time (lets say hours) without ever calling WriteCompleteAsync can eventually cause a memory pressure/leak situation

It might cause issues in the BigQuery Storage API, but it shouldn't cause any local memory issues - because we don't hold on to the messages locally.

I don't know much about the BigQuery Storage API, but it sounds like it should be fine to fetch and write to the default stream frequently. If you wanted to move to use application streams, I suspect your "create stream, write, complete request" every 20 seconds might cause issues, and instead you'd want to keep it open for some time - but I'd probably guess at an hour being a reasonable time rather than "the whole lifetime of the process".

I still really don't understand why you were getting local memory issues though... next time you're making a change, if you could add the logging as per this earlier comment, that would be good to know:

It might be worth something logging each time you call WriteAsync, with the number of items and how long it takes for the returned task to complete. Not because I expect anything in particular - just because it looks like "something" happens, and it'd be interesting to see whether the stats for the call change at the same time.

(If you're not already logging the length of your own queue, that would be good to know too.)

indy-singh · 2023-03-15T17:39:44Z

Hi @jskeet,

Apologies, I've been AFK. I now have the additional logging (see attached zip). The logs are from our DEMO environment which is low traffic, so where beforeAggr ramps up to 10000 that is me load testing with good old fashioned curl.

This is the code that I logged out:-

var sw = Stopwatch.StartNew();

try
{
    sw.Restart();
    var protoData = new AppendRowsRequest.Types.ProtoData
    {
        WriterSchema = _writerSchema,
        Rows = new ProtoRows
        {
            SerializedRows = { list.Select(x => x.ToByteString()) },
        },
    };

    var timeToCreateRows = sw.Elapsed.TotalMilliseconds;
    
    sw.Restart();
    var appendRowsStream = _bigQueryWriteClientBuilder.AppendRows();
    var timeToFetchDefaultStream = sw.Elapsed.TotalMilliseconds;

    sw.Restart();
    await appendRowsStream.WriteAsync(new AppendRowsRequest
    {
        ProtoRows = protoData,
        WriteStream = _writeStreamName,
    });
    var timeToWriteAsync = sw.Elapsed.TotalMilliseconds;

    sw.Restart();
    await appendRowsStream.WriteCompleteAsync();
    var timeToWriteCompleteAsync = sw.Elapsed.TotalMilliseconds;

    if (Settings.Config.IsEnabled("OtherBigQueryStatsSaverLogsOn"))
    {
        Log.Warn(new LogDetails($"Aggregated {length} individual saves to be {localConcurrentDictionary.Count} saves")
        {
            ObjectDump = new
            {
                count = list.Count,
                duration = sw.Elapsed.TotalMilliseconds,
                remaining = _concurrentQueue.Count,
                beforeAggr = length,
                afterAggr = localConcurrentDictionary.Count,
                sizeOfRequest = list.Sum(x => x.CalculateSize()),
                timeToCreateRows,
                timeToFetchDefaultStream,
                timeToWriteAsync,
                timeToWriteCompleteAsync
            }
        });
    }
}

logs.zip

Cheers,
Indy

jskeet · 2023-03-15T17:45:37Z

Thanks, I'll have a look at the logs tomorrow.

indy-singh · 2023-03-15T17:48:51Z

There's not many, I'll need to rinse and repeat on production either later on tonight or first thing in the morning. Unfortunately, because it logs a lot of production it's not something I can leave on unattended.

jskeet · 2023-03-16T08:32:18Z

Thanks for those logs. The log entry at 2023-03-15 17:19:25.312785 which seems to take 20 seconds is slightly worrying - but I wonder whether that was while the server was busy with other initialization tasks. I'm glad you mentioned the load testing for the period where "remaining" didn't stay at 0.

But otherwise, this doesn't seem to show anything particularly concerning. Given the short timeframe, I assume you didn't spot any memory issues during this run either? Will look forward to further logs - sorry that it's taking a while to get to the bottom of this.

indy-singh · 2023-03-16T11:31:14Z

Hi @jskeet ,

I'm not sure I understand where you are getting 20 seconds from? All those time are in milliseconds (e.g. sw.Elapsed.TotalMilliseconds).

I've switched on the logs for prod since 2023-03-16T09:16:52 please find them attached I'll keep them switched on throughout the day.

prod_logs.zip

The good news is I don't see any memory issues right now, so we might have fixed by fetching the default stream and completing it around a write.

Cheers,
Indy

indy-singh · 2023-03-16T11:42:22Z

Will look forward to further logs - sorry that it's taking a while to get to the bottom of this.

Ah it's no big deal fella, it's being investigated and it's not uncommon for me to spend a few days on platform issues like this. It's actually pretty fun! 👍🏽

Cheers,
Indy

jskeet · 2023-03-16T11:55:24Z

Ah, whoops - for some reason I was assuming seconds. 20ms really shouldn't be a problem :)

Glad to hear completing the write looks like it may have fixed things. As I mentioned before, I don't know whether it's really appropriate to stream just a single request each time - but it does sound like you're well within the limits. You might want to consider streaming for somewhat longer (e.g. 10 minutes or an hour) but this isn't an area of expertise for me.

(I still don't understand why it would change the memory usage though....)

indy-singh · 2023-03-16T12:47:33Z

I still think there is a very gentle incline in memory consumption. It's hard to see over a day, but if we graph it over two weeks you can see the upward slopes. You can ignore the spikes entirely, those are post-deploy spikes which I think are to do with NET6 itself they recover after a period of time.

The real test will be the weekend on our DEMO environment. You can see that 2023-03-10 through to 2023-03-13 is flat. And after 2023-03-13 is when I started the migration to Storage Write API.

Cheers,
Indy

indy-singh · 2023-03-16T13:16:58Z

I've commandeered a DEMO box and I'm going to let:-

dotnet-counters collect --process-id my_pid --refresh-interval 30 --format csv

Run for an hour or so, to see if that will tell us anything useful.

https://learn.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-counters#examples

So that googleapis/google-cloud-dotnet#10034 is easier to diagnose

indy-singh · 2023-03-16T14:11:29Z

Not sure how useful this is going to be, but at least it gives us a baseline.

demo_counters.zip

indy-singh · 2023-03-16T15:41:52Z

I rolled out the reflection code to log out bufferedWriteCount at 2023-03-16 15:26:07.47691+00 and across 495 (that's all of them so far) samples it is always 1.

Cheers,
Indy

This seems to "leak" I use that word very loosely, as it is more than likely going onto Gen0. For googleapis/google-cloud-dotnet#10034

indy-singh · 2023-03-16T16:15:38Z

So I can get this to happen:-

0893/5000       35,939 kb
0894/5000       35,969 kb
0895/5000       35,998 kb
0896/5000       36,014 kb
0897/5000       36,058 kb
0898/5000       36,092 kb
0899/5000       36,098 kb
0900/5000       36,149 kb
0901/5000       36,169 kb
0902/5000       36,211 kb
0903/5000       36,237 kb
0904/5000       36,271 kb
0905/5000       36,287 kb
0906/5000       36,339 kb
0907/5000       36,373 kb
0908/5000       36,399 kb
0909/5000       36,425 kb
0910/5000       36,458 kb
0911/5000       36,466 kb
0912/5000       36,518 kb
0913/5000       36,555 kb
0914/5000       36,586 kb
0915/5000       36,620 kb
0916/5000       36,646 kb
0917/5000       36,671 kb
0918/5000       36,696 kb
0919/5000       36,768 kb
0920/5000       36,786 kb
0921/5000       36,837 kb
0922/5000       36,871 kb
0923/5000       36,905 kb
0924/5000       36,931 kb
0925/5000       36,963 kb
0926/5000       36,970 kb
0927/5000       37,012 kb
0928/5000       37,056 kb
0929/5000       37,064 kb

With this commit. But I think that is sort of expected as there is a lot of byte[] allocation churn when adding to SerializedRows.

Cheers,
Indy

jskeet · 2023-03-30T15:53:45Z

Yes, but I want to give the chunk change from 400 -> 200 a few hours before I make another change.

Absolutely! I'm very much a fan of "one change at a time" :)

Btw, GAX doesn't use MemoryStream at all. Google.Protobuf does, but only in ByteString.FromStream/FromStreamAsync, which I don't think will be in use here. Admittedly my MemoryStream theory is purely speculative at the moment anyway :)

indy-singh · 2023-03-30T16:23:46Z

So while we wait for hours to pass by, I'm poking about the Google.Protobuf src and looking for factors of 131072.

And I can only really find:-

https://github.com/protocolbuffers/protobuf/blob/main/csharp/src/Google.Protobuf/CodedInputStream.cs#L82

https://github.com/protocolbuffers/protobuf/blob/main/csharp/src/Google.Protobuf/CodedOutputStream.cs#L63

Neither of which are multiplied at all from what I can see.

indy-singh · 2023-03-30T16:47:47Z

Poking around the leaky box memory dump using dotnet-dump and running finalizequeue

finalizequeue.log

I have no idea if any of that is relevant, or if it is just confirmation bias. But do we expect Grpc.Core.Internal.BatchContextSafeHandle to be at the top?

indy-singh · 2023-03-30T16:51:28Z

gchandleleaks.log

I'm on the edge of my knowledge here, so take all of this with a truck load of salt. But I find it sus that it finds 802 handles, and then can't find them. Weird.

jskeet · 2023-03-30T17:20:47Z

I have no idea if any of that is relevant

Me either! Weirdly enough, I can't even find that (or CallSafeHandle) in the new code. They're both here though for 2.46: https://github.com/grpc/grpc/tree/v1.46.x/src/csharp/Grpc.Core/Internal

Just to check - was this memory dump taken when using Grpc.Core, or Grpc.Net.Client (the default)? (Don't be fooled by the name of Grpc.Core.Internal - that can be seen with both implementations.)

indy-singh · 2023-03-30T21:37:37Z

The 400 -> 200 batch size change was deployed to our DEMO environment from 16:15.

The slow leak is still there, it doesn't look it has changed at all. I've no idea why the shape - I'm calling them sawtooth's - have elongated after 18:00.

Are you easily able to try consuming the response?

Yes, I've made this change, and only enabled it on one box. So I'm leaving 104 as completely disabled, 103 is consuming the response, 102 & 101 are left as is.

Just to check - was this memory dump taken when using Grpc.Core, or Grpc.Net.Client (the default)?

Taken with <PackageReference Include="Grpc.Core" Version="2.46.6" /> which I've also now ripped out as it made no discernible difference.

Cheers,
Indy

jskeet · 2023-03-30T21:42:07Z

Just to check, when you added the Grpc.Core reference, you did also set the GrpcAdapter when building the client, right?

indy-singh · 2023-03-30T21:47:21Z

Just to check, when you added the Grpc.Core reference, you did also set the GrpcAdapter when building the client, right?

Yes.

Another thing which would be useful if you could do it would be to change the "off" box to still think it's writing the records - but use a fake gRPC implementation which just discards all requests. I can help write that code if it would be useful. (There are a couple of different options there.) If we see the memory go up even with a fake, then we should be able to work on something that reproduces the issue without ever having to touch GCP, which would really help. My expectation is that you'd be able to have the same code on all machines, and just a change to how you construct the BigQueryWriteClientBuilder for the "off" machine.

Totally missed this - this sounds like a great thing to try tomorrow.

Cheers,
Indy

indy-singh · 2023-03-31T07:36:50Z

I think we have a winner;

101 - as normal
102 - as normal
103 - consume grpc response
104 - BQ saver disabled.

Next steps are to bump back up to chunks of 400 and roll out response consumption to all boxes.

It's rather black box at the moment, I'd love to know where this slow leak is actually occurring.

Cheers,
Indy

jskeet · 2023-03-31T07:54:33Z

Okay, so presumably our working theory is that if we don't consume the responses, then in the real implementation they gradually build up and are never garbage collected. In your fake implementation presumably the response never actually gets populated, or is eligible for garbage collection.

I suspect it should be possible to repro this with just a local gRPC client and server (with a simple streaming API definition) - but that's definitely not something I'd expect you to look into. If just consuming the gRPC response works, I'd stick with that :) I can ask my colleagues in the gRPC team to take a look.

indy-singh · 2023-03-31T08:23:31Z

In your fake implementation presumably the response never actually gets populated, or is eligible for garbage collection.

No, as far as I can tell the response is there and can be read. But the main factor here is time, none of the benchmarks we ran have been over six hours. I would like to fire and forget, I don't necessarily care about the response in this circumstance, It's been interesting!

Cheers,
Indy

indy-singh · 2023-03-31T12:47:34Z

Oh, I just noticed that appendRowsStream.GetResponseStream() is a disposable.

So currently I'm doing this to consume the response stream:-

await foreach (var appendRowsResponse in appendRowsStream.GetResponseStream())
{

}

I'm wondering if I should actually be doing this?

await using (var asyncResponseStream = appendRowsStream.GetResponseStream())
{
    await foreach (var appendRowsResponse in asyncResponseStream)
    {
        
    }
}

I'm also wondering if the simple act of foreach'ing over it is implicitly disposing of it?

jskeet · 2023-03-31T13:08:38Z

Hmm... I don't think actually disposing of it, but I suspect that iterating over everything has the same effect. The latter approach is probably a better approach until I can confirm for sure that the former approach is okay... although I'd really hope it is, just from a usability perspective.

indy-singh · 2023-03-31T14:51:07Z

This is with all the boxes consuming the response:-

I've just made the change to wrap the consumption in a using too.

Cheers,
Indy

jskeet · 2023-03-31T14:53:23Z

That's very good to see :) I'm planning on trying to reproduce the problem with a local server and local client next week - will you have graphs from over the weekend just to confirm that it's stayed this way?

indy-singh · 2023-03-31T14:55:52Z

Yeah mate, we have graphs going back four years 😄 I'm confident that this is the fix as it normally only required 4-6 hours for the memory baseline to start drifting.

Cheers for your help 👍🏽,
Indy

indy-singh · 2023-04-03T06:04:11Z

Morning - this is the weekend.

DEMO:

LIVE:

You can see the slow ramp at the very beginning but once the response consumption fix was deployed you can see that the slow ramp was eliminated. There is a lot of churn in Gen0 with all the byte arrays so it is still more spiky than insertAll. But the leak is fixed as far as I can tell! 👌

Cheers,
Indy

jskeet · 2023-04-03T07:10:33Z

Excellent! I'm glad that the immediate issue is fixed - now we're a little clearer about it, I'll work on reproducing it for the gRPC team.

jskeet · 2023-04-03T13:54:51Z

I've managed to reproduce this remarkably easily. I'm talking with the gRPC team about what the next steps are - but fundamentally, reading the responses is worth doing. Disposing of the underlying gRPC call prevents the leak, but appears to be "unfriendly" to the connection in some way. I'd recommend code like this:

var stream = client.AppendRows();
// With this in place, the call will be disposed even if something fails.
using var grpcCall = stream.GrpcCall;

// Make your writes here
await stream.WriteAsync(request);
await stream.WriteCompleteAsync();

// And read the responses at the end
await using var responses = stream.GetResponseStream();
await foreach (var response in responses) {}

That's making the assumption that writes can succeed without previous responses having been read. I expect that's the case, but I don't have concrete evidence for that.

indy-singh · 2023-04-03T14:10:51Z

I think perhaps there is dependent on use case; in this case (and actually all cases where we use BQ and gRPC) we don't care about the response at all. It is very much fire and forget.

await using var responses = stream.GetResponseStream();
await foreach (var response in responses) {}

I'm fine to do that, but maybe the docs need to be updated to explicitly warn about what happens if response consumption is absent?

Alternatively, if I could configure the BQ client to be "fire-and-forget" that would be ideal too.

jskeet · 2023-04-03T14:41:38Z

We'll absolutely be doing some combination of documentation and code changes.

Ideally gRPC would be able to clean up in the background aside from anything else, but if we at least make the type returned by AppendRows implement IDisposable, then it would be an improvement.

indy-singh · 2023-04-03T14:50:29Z

using var appendRowsStream = _bigQueryWriteClientBuilder.AppendRows(); would be perfect as it would fall into the path of success with static code analyzers and other tooling that warns about undisposed resources.

Yeah I gotta admit I was of the opinion that the gRPC in the background would clean itself up after the method invocation had finished. Ah well.

Do you mind tagging me when the documentation is updated? No rush, just keen to understand more of how we should be using BQ gRPC calls.

jskeet · 2023-04-03T14:55:50Z

Do you mind tagging me when the documentation is updated? No rush, just keen to understand more of how we should be using BQ gRPC calls.

Sure. It may well be in multiple stages though, as each individual component or piece of documentation is updated.

indy-singh · 2023-04-09T14:04:26Z

Hmmm, the mystery continues! It's been rock solid for a few days and I'm just checking the graphs post-easter-eggs and the slow growth has re-appeared again. To be clear, the code around consuming the response and disposal is still in and enabled.

I'll dig into it more post-lamb :)

Hope you are having a nice Easter.

Cheers,
Indy

indy-singh · 2023-04-09T21:17:17Z

It looks like all the growth is in gen 2 - we've only just rolled out querying dotnet counters so the graphs are a little patchy.

indy-singh · 2023-04-09T21:39:36Z

Ahhh, it seems like BQ was "down" and the exception path in that scenario somehow leaks? We have a few logs on our DEMO environment that seem to line up at first glance. I'm used to the legacy insertAll method logging a lot and timing out (5 seconds). The gRPC stuff has been relatively rock solid and highly performant thus far.

I've attached the raw logs.

logs.zip

Looks like all the exceptions fall into one the following seven:-

"Grpc.Core.RpcException: Status(StatusCode=""Aborted"", Detail=""Closing the stream because server is restarted. This is expected and client is advised to reconnect. Entity: projects/indy-test-watchtower/datasets/DEMO_watchtower/tables/watchtower_v2/streams/_default"")
"Grpc.Core.RpcException: Status(StatusCode=""Cancelled"", Detail=""Call canceled by the client."")
"Grpc.Core.RpcException: Status(StatusCode=""Cancelled"", Detail=""CANCELLED"", DebugException=""Grpc.Core.Internal.CoreErrorDetailException: ""CANCELLED"""")
"Grpc.Core.RpcException: Status(StatusCode=""Internal"", Detail=""Internal error encountered. Entity: projects/indy-test-watchtower/datasets/DEMO_watchtower/tables/watchtower_v2/streams/_default"")
"Grpc.Core.RpcException: Status(StatusCode=""Unavailable"", Detail=""502:Bad Gateway"")
"Grpc.Core.RpcException: Status(StatusCode=""Unavailable"", Detail=""Error starting gRPC call. HttpRequestException: An error occurred while sending the request. IOException: The request was aborted. IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.. SocketException: An existing connection was forcibly closed by the remote host."", DebugException=""System.Net.Http.HttpRequestException: An error occurred while sending the request.
"Grpc.Core.RpcException: Status(StatusCode=""Unavailable"", Detail=""Request 'AppendRows' from role 'cloud-dataengine-globalrouting' throttled: Task is overloaded (memory-protection) go/tr-t."")

I guess in the case of any RpcException I should be over writing the existing _bigQueryWriteClientBuilder with a new instance.

Cheers,
Indy

jskeet · 2023-04-11T07:03:14Z

I guess in the case of any RpcException I should be over writing the existing _bigQueryWriteClientBuilder with a new instance.

I wouldn't expect that to help, although I guess it could.

Do you definitely have the code that disposes the underlying gRPC call, even if there's an error while reading the responses? This line of code should do it:

using var grpcCall = stream.GrpcCall;

If you haven't already got that, I'd add it. (We'll make stream itself implement IDisposable soon.)

If you do already have that, I don't know where the leak has come from...

indy-singh · 2023-04-11T08:08:54Z

Do you definitely have the code that disposes the underlying gRPC call, even if there's an error while reading the responses?

No we don't. I've added it now 👍🏽

This is what we have post-add:-

using var cancellationTokenSource = new CancellationTokenSource(TimeSpan.FromSeconds(5));
var appendRowsStream = _bigQueryWriteClientBuilder.AppendRows(CallSettings.FromCancellationToken(cancellationTokenSource.Token));
// as per https://github.com/googleapis/google-cloud-dotnet/issues/10034#issuecomment-1502784359
using var grpcCall = appendRowsStream.GrpcCall;
var sizes = new List<int>();

// ToByteString creates a lot of memory churn on Gen0
// The average watchtower record when protobuf is around 162.4876265161838294 bytes (sample size of 115215).
// so we now chunk over the list in chunks of 400 which would put us at around 64,800 bytes for this list in it's entirety.
// previously the average list size is 453047.083047893235 bytes (sample size of 115319)
// which was going onto the LOH
foreach (var records in list.Chunk(400))
{
    sizes.Add(records.Sum(x => x.CalculateSize()));

    var protoData = new AppendRowsRequest.Types.ProtoData
    {
        WriterSchema = _writerSchema,
        Rows = new ProtoRows
        {
            SerializedRows = { records.Select(x => x.ToByteString()) },
        },
    };

    await appendRowsStream.WriteAsync(new AppendRowsRequest
    {
        ProtoRows = protoData,
        WriteStream = _writeStreamName,
    }).ConfigureAwait(false);
}

await appendRowsStream.WriteCompleteAsync().ConfigureAwait(false);

// this is disposable and recommended here: https://github.com/googleapis/google-cloud-dotnet/issues/10034#issuecomment-1491902289
await using (var asyncResponseStream = appendRowsStream.GetResponseStream())
{
    // need to consume the response even if we don't do anything with it
    // other slow ass memory leak: https://github.com/googleapis/google-cloud-dotnet/issues/10034#issuecomment-1491449182
    await foreach (var appendRowsResponse in asyncResponseStream)
    {

    }
}

Cheers,
Indy

indy-singh mentioned this issue Mar 14, 2023

[C#] Memory Leak in ToByteString() protocolbuffers/protobuf#12217

Closed

jskeet self-assigned this Mar 14, 2023

indy-singh added a commit to indy-singh/bigquery_storage_write_api_high_memory_allocation that referenced this issue Mar 16, 2023

Added the latest copy of the prod saver

51589c0

So that googleapis/google-cloud-dotnet#10034 is easier to diagnose

indy-singh mentioned this issue Jun 10, 2024

[BigQuery/Storage Write API] Default GrpcAdapter Memory Leak on Server 2022 #13150

Closed

etrandafir93 mentioned this issue Oct 21, 2024

BigQuery Storage Write API v1: Request 'AppendRows' from role 'cloud-dataengine-globalrouting' throttled: Task is overloaded (memory-protection) go/tr-t googleapis/java-bigquerystorage#2725

Open

[BigQuery] High Number of Memory Allocations when migrating from legacy insertAll to Storage Write API. #10034

[BigQuery] High Number of Memory Allocations when migrating from legacy insertAll to Storage Write API. #10034

Comments

indy-singh commented Mar 14, 2023 • edited Loading

Environment details

Steps to reproduce

jskeet commented Mar 14, 2023 • edited Loading

indy-singh commented Mar 14, 2023

jskeet commented Mar 14, 2023

indy-singh commented Mar 14, 2023

jskeet commented Mar 14, 2023

indy-singh commented Mar 14, 2023

jskeet commented Mar 14, 2023

indy-singh commented Mar 14, 2023

indy-singh commented Mar 14, 2023

jskeet commented Mar 14, 2023

indy-singh commented Mar 14, 2023

jskeet commented Mar 15, 2023

indy-singh commented Mar 15, 2023

jskeet commented Mar 15, 2023

indy-singh commented Mar 15, 2023 • edited Loading

jskeet commented Mar 15, 2023

indy-singh commented Mar 15, 2023

jskeet commented Mar 16, 2023

indy-singh commented Mar 16, 2023

indy-singh commented Mar 16, 2023

jskeet commented Mar 16, 2023

indy-singh commented Mar 16, 2023

indy-singh commented Mar 16, 2023

indy-singh commented Mar 16, 2023

indy-singh commented Mar 16, 2023

indy-singh commented Mar 16, 2023

jskeet commented Mar 30, 2023

indy-singh commented Mar 30, 2023

indy-singh commented Mar 30, 2023 • edited Loading

indy-singh commented Mar 30, 2023 • edited Loading

jskeet commented Mar 30, 2023

indy-singh commented Mar 30, 2023

jskeet commented Mar 30, 2023

indy-singh commented Mar 30, 2023

indy-singh commented Mar 31, 2023

jskeet commented Mar 31, 2023

indy-singh commented Mar 31, 2023

indy-singh commented Mar 31, 2023

jskeet commented Mar 31, 2023

indy-singh commented Mar 31, 2023

jskeet commented Mar 31, 2023

indy-singh commented Mar 31, 2023

indy-singh commented Apr 3, 2023

jskeet commented Apr 3, 2023

jskeet commented Apr 3, 2023

indy-singh commented Apr 3, 2023

jskeet commented Apr 3, 2023

indy-singh commented Apr 3, 2023

jskeet commented Apr 3, 2023

indy-singh commented Apr 9, 2023

indy-singh commented Apr 9, 2023

indy-singh commented Apr 9, 2023

jskeet commented Apr 11, 2023

indy-singh commented Apr 11, 2023

indy-singh commented Mar 14, 2023 •

edited

Loading

jskeet commented Mar 14, 2023 •

edited

Loading

indy-singh commented Mar 15, 2023 •

edited

Loading

indy-singh commented Mar 30, 2023 •

edited

Loading

indy-singh commented Mar 30, 2023 •

edited

Loading