Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google.Cloud.Diagnostics: ISpan is doing blocking I/O #2791

Closed
erdalsivri opened this issue Jan 7, 2019 · 23 comments
Closed

Google.Cloud.Diagnostics: ISpan is doing blocking I/O #2791

erdalsivri opened this issue Jan 7, 2019 · 23 comments
Assignees
Labels
api: cloudtrace Issues related to the Cloud Trace API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@erdalsivri
Copy link

We are using Google.Cloud.Diagnostics.{Common,AspNetCore} packages for our Asp.Net Core web application. Google Cloud tracer seems to be doing a blocking flush to send traces to the server. On production we've identified several hundreds of threads having Google.Cloud.Diagnostics.Common.FlushableConsumerBase in their stack trace.

We are using CloudTraceMiddleware, which calls Dispose on the span object. Here is the complete call chain:

CloudTraceMiddleware.Invoke -> Span.Dispose -> SimpleManagerTracer.EndSpan -> SimpleManagerTracer.Flush -> IConsumer<TraceProto>.Receive

There is an async version of the Receive method but it is not used by the Span class. Is there a way to make spans make non-blocking calls?

Environment details

  • Windows
  • .Net Core 2.1
  • Google.Cloud.Diagnostics.AspNetCore 3.0.0-beta06

Thanks!

@amanda-tarafa
Copy link
Contributor

I'll take a look at this tomorrow.

@amanda-tarafa amanda-tarafa self-assigned this Jan 7, 2019
@amanda-tarafa amanda-tarafa added priority: p2 Moderately-important priority. Fix may not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. labels Jan 7, 2019
@erdalsivri
Copy link
Author

My colleagues pointed out that the library is doing a timed buffering by default. We use the default settings so the interval seems to be 5 seconds. TimedBufferingConsumer is starting a timer and calls Flush every 5 seconds. Flush is acquiring a semaphore and doing a blocking I/O to send the traces and releasing the semaphore. I don't think sending traces collected over a period of 5 seconds will take a lot of time so even if the I/O is blocking, since it is done only in 5 seconds, it doesn't explain the 500 blocked threads we've seen in production. However, since we are using default settings, we are ignoring exceptions so maybe when the tracer for some reason starts receiving errors and long delays, timer will still keep ticking and new threads will block on semaphore . This is just a wild guess though.

TimedBufferingConsumer wraps the Flush call inside a try-catch block and ignores all exceptions so difficult to understand if there is in fact a problem while trying to communicate to Google trace servers. I am following feature request #2182.

@amanda-tarafa
Copy link
Contributor

Hi @erdalsivri, here are some notes on your questions:

CloudTraceMiddleware.Invoke -> Span.Dispose -> SimpleManagedTracer.EndSpan -> SimpleManagerTracer.Flush -> IConsumer<TraceProto>.Receive

There is an async version of the Receive method but it is not used by the Span class. Is there a way to make spans make non-blocking calls?

The problem with having the Span class using IConsumer.ReceiveAsync is that IDisposable does not support async Dispose, so we would have to do some fire-and-forgetting, which I'm not sure we would want to do. I think we could implement CloudTraceAsyncMiddleware but I'll have to discuss this with the rest of the team and this won't happen before next week.

That said:

Google Cloud tracer seems to be doing a blocking flush to send traces to the server. On production we've identified several hundreds of threads having Google.Cloud.Diagnostics.Common.FlushableConsumerBase in their stack trace.

This shouldn't be blocking to send traces to the server unless you have configured CloudTraceMiddleware in a very specific way. By default we use TimedBufferingConsumer that on Receive, only adds the trace to a local collection. The collection is flushed to the server whenever the timer is up, but that is then done on another thread, "the timer" one.

Can you share how you have configured CloudTraceMiddleware? You can see some samples here.

@amanda-tarafa
Copy link
Contributor

amanda-tarafa commented Jan 8, 2019

Oh, it seems we replied to the thread at almost the same time.
#2182 is currently on the backlog (i.e. no date planned for doing this) mainly because we haven't a good solution for it, if tracing/logging is failing then when don't have a "good" place where to log these failures.

I'm taking a closer look to see if there is a bug with the TimeBufferingConsumer that might be blocking some threads. I'll update here.

@erdalsivri
Copy link
Author

erdalsivri commented Jan 8, 2019

I think one improvement would be fixing #2182 and the other would be calling FlushAsync instead of Flush inside the timer callback. Timer doesn't support an async callback so it needs to be async void (a bit dangerous) or maybe instead of a Timer, SequentialThreadingTimer could be used. Of course it needs to be modified to take a Func<Task> instead of an Action.

I will try to use SizedBufferingConsumer and try to log exceptions to get a better idea about what is happening. Since CloudTraceMiddleware is calling Dispose, this time writes will definitely be blocking because ReceiveWithSemaphoreHeld is calling FlushWithSemaphoreHeld when the buffer is full. Unless we keep the buffer small, this shouldn't be a big problem but I think it is better to do the writes non-blocking.

CloudTraceMiddleware is explicitly calling Dispose on the ISpan object (it is not using using block) so it could as well call a non-standard DisposeAsync equivalent. This is just a suggestion; I don't have data to back the claim that blocking writes are hurting the app significantly.

@amanda-tarafa
Copy link
Contributor

I've submitted #2792 that I think might help, and that anyways fixes a bug. As for your last comments:

I think one improvement would be fixing #2182.

This is non-trivial and that's why it's currently in the backlog. Doesn't mean it is forgotten, just means it's not prioritized or in the active pipeline. But honestly, I don't see how swallowing the exceptions as we are now would make the thread block, and even if the timer thread is blocked by this, your application thread wouldn't be blocked (unless the exception happened when releasing the semaphore which is highly unlikely), so your application would be responding normally. The symptom would be few traces logged.

and the other would be calling FlushAsync instead of Flush inside the timer callback

Again, even if calling Flush and not FlushAsync is blocking the timer thread, I don't see how that can block your application thread (which is what I understand is being blocked when you say blocking I/O). The symptom here would be few traces logged and not blocked application threads I think.

I will try to use SizedBufferingConsumer and try to log exceptions to get a better idea about what is happening.

Yes, that seems like a way to find if there are exceptions being thrown at some point.

CloudTraceMiddleware is explicitly calling Dispose on the ISpan object (it is not using using block) so it could as well call a non-standard DisposeAsync equivalent.

Yes, that would be similar to what I was proposing of implementing CloudTraceAsyncMiddleware, but take into account that even in CloudTraceMiddleware, the Invoke method is async Task, so, even if we were to call some DisposeAsync equivalent, the pipeline, or what I've been calling your application thread would await on the Invoke method which in turn would await on the DisposeAsync equivalent, and in the end you would have the same blocking behavior. The key to having tracing not interfere with your application thread is to have both the receiving and the writing to Stackdriver done in a fire-and-forget way, which I'm not sure we want to have right now. But the closest you can get to that is using the TimedBufferingConsumer which writes to Stackdriver on the timer thread except for the write performed when disposing.

Could you send the following info so we can keep looking at this in more detail:

  • When you say blocking I/O, can you confirm that your application threads are blocking, that is threads that are handling a user request? What are the symptoms in your application, i.e. responses are taking too long or timing out?
  • You mentioned that you could see FlushableConsumerBase on the stack trace of the blocked threads, can you see in which method exactly are they being blocked? Can you send the whole stack trace?
  • How old is the oldest blocked thread, a few seconds, a few minutes, hours...?
  • How does the volume of blocked threads compares with the volume of requests your application receives?
  • And if you could send the simplest application possible that reproduces this, that would be great.

@erdalsivri
Copy link
Author

I've submitted #2792 that I think might help, and that anyways fixes a bug. As for your last comments:

I think one improvement would be fixing #2182.

This is non-trivial and that's why it's currently in the backlog. Doesn't mean it is forgotten, just means it's not prioritized or in the active pipeline. But honestly, I don't see how swallowing the exceptions as we are now would make the thread block, and even if the timer thread is blocked by this, your application thread wouldn't be blocked (unless the exception happened when releasing the semaphore which is highly unlikely), so your application would be responding normally. The symptom would be few traces logged.

I agree fixing #2182 will in no way affect threads. I just thought it could be useful to see if there were exceptions being eaten up. I understand it may be difficult and it is definitely not urgent for us. You can consider this as a "me too" for that bug.

and the other would be calling FlushAsync instead of Flush inside the timer callback

Again, even if calling Flush and not FlushAsync is blocking the timer thread, I don't see how that can block your application thread (which is what I understand is being blocked when you say blocking I/O). The symptom here would be few traces logged and not blocked application threads I think.

Sorry I wasn't clear when I said threads getting blocked. Our application request threads are not blocked but we see many blocked threads in the dump. I assume Timer executes the callback in a thread from the ThreadPool and if they can be blocked waiting on the Semaphore if some other thread is holding it at the same time. This is quite unlikely assuming it doesn't take many seconds to send traces to the server but maybe that's what has happened in our application when the thread count got high. Maybe a temporary network problem caused flushes to delay and Timer threads piled up blocking on Semaphore.Wait, which could be changed to Semaphore.WaitAsync to yield the thread back to the thread pool. I am not saying this is what has happened in production but it is not impossible.

I will try to use SizedBufferingConsumer and try to log exceptions to get a better idea about what is happening.

Yes, that seems like a way to find if there are exceptions being thrown at some point.

CloudTraceMiddleware is explicitly calling Dispose on the ISpan object (it is not using using block) so it could as well call a non-standard DisposeAsync equivalent.

Yes, that would be similar to what I was proposing of implementing CloudTraceAsyncMiddleware, but take into account that even in CloudTraceMiddleware, the Invoke method is async Task, so, even if we were to call some DisposeAsync equivalent, the pipeline, or what I've been calling your application thread would await on the Invoke method which in turn would await on the DisposeAsync equivalent, and in the end you would have the same blocking behavior. The key to having tracing not interfere with your application thread is to have both the receiving and the writing to Stackdriver done in a fire-and-forget way, which I'm not sure we want to have right now. But the closest you can get to that is using the TimedBufferingConsumer which writes to Stackdriver on the timer thread except for the write performed when disposing.

I don't mind if one out of a thousand requests gets a bit delayed to post the traces (assuming not using TimedBufferingConsumer). I am concerned when the I/O or any other waits (on Semaphore for example) are not asynchronous.

Could you send the following info so we can keep looking at this in more detail:

  • When you say blocking I/O, can you confirm that your application threads are blocking, that is threads that are handling a user request? What are the symptoms in your application, i.e. responses are taking too long or timing out?

Responses are fine. As I mentioned above, I didn't explain myself very well before. We experienced high thread count in our application and saw Google.Cloud.Diagnostics.Common.FlushableConsumerBase in the stack trace of 500-600 blocked threads. It is highly unlikely to get 600 blocked threads using TimedBufferConsumer with a 5 second interval but maybe for some reason the Semaphore is not released (honestly can't see how from the code) and each time the Timer ticks we get another thread blocking on the Semaphore.

  • You mentioned that you could see FlushableConsumerBase on the stack trace of the blocked threads, can you see in which method exactly are they being blocked? Can you send the whole stack trace?
  • How old is the oldest blocked thread, a few seconds, a few minutes, hours...?
  • How does the volume of blocked threads compares with the volume of requests your application receives?
  • And if you could send the simplest application possible that reproduces this, that would be great.

I will try to collect more data or even try to reproduce it. Feel free to close the bug until that (hope I can). Thanks!

@amanda-tarafa
Copy link
Contributor

OK, I think I see your concern now, which is "mostly" with not having that many threats blocked. Yes, "asyncing" everything will have the effect of threads being returned to the pool, but those blocked tasks (tasks not threads) will still be "blocked" waiting on the semaphore (if that is indeed what is happening) and we won't know why it's happening either. Admittedly the amount of blocked tasks (and threads at the moment) might be a natural occurrence due to some network errors or even potentially to a surge in requests (as per your original post the blocking is happening on the middleware which triggers per request). That's why I was asking about the ratio of blocked threads vs requests and the age of the oldest blocked thread. But also the amount of blocked tasks might be because of a bug (I haven't been able to find anything suspicious though, the semaphore seems to be released properly). If there's a bug I'd really like to find it.
I'll leave this issue open at least till I've had a change to chat with the rest of the team about the possibility of the DisposeAsync equivalent and the async timer. And in the meantime, if you could get at least part of the info I requested, that'd be great, it is that info that might help with catching the bug, if there is one, and not releasing the threads back to the pool. Does this makes sense to you or do you think I'm missing something?

@erdalsivri
Copy link
Author

Makes total sense thanks! In the meantime, I will try to collect some data.

@erdalsivri
Copy link
Author

I have more information. Below is a sample stack trace extracted from a dump. We have about 600 stack traces like this. FlushableConsumerBase.Receive seems to be blocked on the Semaphore. The same Semaphore is also used by FlushableConsumerBase.Flush method. I guess one of the flushes are taking a bit longer and thus blocking receives too. It makes sense both methods to block on the same Semaphore as it is basically synchronizing access to the _items variable.

Looks like CloudTracer cannot keep up with the amount of traces we produce so flushing them is taking a long time (not sure, just guessing). Maybe the solution is to add a timeout to Semaphore.Wait to give up adding new items if we think it can't keep up with the amount of data.

For context, I should probably mention that our application may have a configuration error or other bugs. We are trying to migrate an Asp.Net application to Asp.Net Core and running canaries, which explode because of high number of threads or sometimes because of other issues.

OS Thread Id: 0x3608 (160)
        Child SP               IP Call Site
000000ebd87d7b80 00007ff9b69d6654 [GCFrame: 000000ebd87d7b80] 
000000ebd87d7cc8 00007ff9b69d6654 [HelperMethodFrame_1OBJ: 000000ebd87d7cc8] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
000000ebd87d7df0 00007ff99084442d System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken)
000000ebd87d7e40 00007ff9908441bd System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken)
000000ebd87d7ee0 00007ff937dbbc0e Google.Cloud.Diagnostics.Common.FlushableConsumerBase`1[[System.__Canon, System.Private.CoreLib]].Receive(System.Collections.Generic.IEnumerable`1)
000000ebd87d7f20 00007ff937d3fea4 Google.Cloud.Diagnostics.Common.SimpleManagedTracer.EndSpan(Span)
000000ebd87d7f90 00007ff937d3fd80 Google.Cloud.Diagnostics.Common.SimpleManagedTracer+Span.Dispose()
000000ebd87d7fd0 00007ff937cbdf9b Google.Cloud.Diagnostics.AspNetCore.CloudTraceMiddleware+d__3.MoveNext()
000000ebd87d80c0 00007ff99074cda9 
...

We also have threads blocked on Flush:

        Child SP               IP Call Site
000000ebcb47def0 00007ff9b69d6654 [GCFrame: 000000ebcb47def0] 
000000ebcb47e038 00007ff9b69d6654 [HelperMethodFrame_1OBJ: 000000ebcb47e038] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
000000ebcb47e160 00007ff99084442d System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken)
000000ebcb47e1b0 00007ff9908441bd System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken)
000000ebcb47e250 00007ff936fbe98a Google.Cloud.Diagnostics.Common.FlushableConsumerBase`1[[System.__Canon, System.Private.CoreLib]].Flush()
000000ebcb47e290 00007ff936fbe913 Google.Cloud.Diagnostics.Common.TimedBufferingConsumer`1[[System.__Canon, System.Private.CoreLib]].b__3_0(System.Object)
000000ebcb47e2d0 00007ff99074cda9 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000ebcb47e350 00007ff99083f40c System.Threading.TimerQueueTimer.CallCallback()
000000ebcb47e3b0 00007ff99083f20a System.Threading.TimerQueueTimer.Fire()
000000ebcb47e420 00007ff990836a8c System.Threading.ThreadPoolWorkQueue.Dispatch()
000000ebcb47e8a0 00007ff99443f643 [DebuggerU2MCatchHandlerFrame: 000000ebcb47e8a0] 

There are 132 stack traces with Flush and 462 stack traces with Receive.

@amanda-tarafa
Copy link
Contributor

OK, that makes sense, the fact that is blocking on both Receive and Flush I mean because the Semaphore is the same for both methods. Now, my big question is, if it is blocking on Receive, which definetely happens on user request, and you are not seing responses affected, could this be just normal behaviour? I mean, could this be that you are just getting lots of requests and the competition for the Semaphore is just due to the volume of requests? In that case then it would make even more sense to try and have all semaphore waiting done async so as not to create and block so many threads. Knowing the age of the oldest blocked thread or the average time these threads stay blocked would be good indicators of whether this is natural or due to a bug.

@erdalsivri
Copy link
Author

It definitely affected the requests but I think we just didn't notice it. Our canary failed miserably for other reasons (related to db connections) so CloudTracer delaying a few requests didn't catch our attention.

My guess is that receives are blocking on the semaphore because of flushes because receive is very simple, just adding a new item to a list (it shouldn't really cause contention). However flush is actually sending the traces to the server. And there are in fact threads blocked on receive in the dump. I don't know how many QPS we were receiving at that time but flushes happen every 5 seconds regardless of QPS and they seem to be blocking the receives (just guessing).

Making the semaphore blocking async (my initial request) could definitely help with the thread explosion but as you mentioned maybe there is a bigger issue here. I think flush needs to release the semaphore after taking a copy of items and hold on to a separate _flushSemaphore to avoid blocking requests or alternative receive should do the semaphore wait async and pass a timeout to the semaphore to avoid blocking the request very long at the cost of dropping some traces. Similarly flush should also have a timeout on the semaphore to avoid trying to send traces while another thread is already busy sending the data.

I am speculating a bit based on the stack traces. My speculations are based on the fact that flush is taking a long time and they keep piling up. I will try to find a stack trace in which a flush is blocked on actually sending the data (I/O); the ones I've seen so far are just blocked on the semaphore. I will update the bug if I find anything interesting.

@erdalsivri
Copy link
Author

Update: I've finally found the thread with a Flush blocked on I/O. Of course this doesn't tell anything about how long this block has been going on. Metrics on our GCP dashboard doesn't indicate anything wrong with the volume. We are under our quota and PatchTraces QPS seems to be 10/s top. We were running 10 nodes at that time if I am not mistaken so it is about 1/s.

        Child SP               IP Call Site
000000ebeaa7da80 00007ff9b69d9274 [InlinedCallFrame: 000000ebeaa7da80] 
000000ebeaa7da80 00007ff935b91eec [InlinedCallFrame: 000000ebeaa7da80] 
000000ebeaa7da50 00007ff935b91eec DomainBoundILStubClass.IL_STUB_PInvoke(Grpc.Core.Internal.CompletionQueueSafeHandle, IntPtr)
000000ebeaa7db20 00007ff937076674 Grpc.Core.Internal.AsyncCall`2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].UnaryCall(System.__Canon)
000000ebeaa7dd20 00007ff9370760f8 Grpc.Core.DefaultCallInvoker.BlockingUnaryCall[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]](Grpc.Core.Method`2, System.String, Grpc.Core.CallOptions, System.__Canon)
000000ebeaa7dea8 00007ff9944401a6 [StubHelperFrame: 000000ebeaa7dea8] 
000000ebeaa7df00 00007ff937075fe8 Grpc.Core.Interceptors.InterceptingCallInvoker.b__3_0[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]](System.__Canon, Grpc.Core.Interceptors.ClientInterceptorContext`2)
000000ebeaa7e000 00007ff937075eae Grpc.Core.ClientBase+ClientBaseConfiguration+ClientBaseConfigurationInterceptor.BlockingUnaryCall[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]](System.__Canon, Grpc.Core.Interceptors.ClientInterceptorContext`2, BlockingUnaryCallContinuation`2)
000000ebeaa7e148 00007ff9944401a6 [StubHelperFrame: 000000ebeaa7e148] 
000000ebeaa7e1a0 00007ff9370759d9 Grpc.Core.Interceptors.InterceptingCallInvoker.BlockingUnaryCall[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]](Grpc.Core.Method`2, System.String, Grpc.Core.CallOptions, System.__Canon)
000000ebeaa7e348 00007ff9944401a6 [StubHelperFrame: 000000ebeaa7e348] 
000000ebeaa7e3a0 00007ff937dc1662 Google.Cloud.Trace.V1.TraceService+TraceServiceClient.PatchTraces(Google.Cloud.Trace.V1.PatchTracesRequest, Grpc.Core.CallOptions)
000000ebeaa7e3f0 00007ff937072bc9 Google.Api.Gax.Grpc.ApiCall+GrpcCallAdapter`2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].CallSync(System.__Canon, Google.Api.Gax.Grpc.CallSettings)
000000ebeaa7e4b0 00007ff93707290f Google.Api.Gax.Grpc.ApiCallRetryExtensions+c__DisplayClass1_0`2[[System.__Canon, System.Private.CoreLib],[System.__Canon, System.Private.CoreLib]].b__0(System.__Canon, Google.Api.Gax.Grpc.CallSettings)
000000ebeaa7e550 00007ff937dc15df Google.Cloud.Trace.V1.TraceServiceClientImpl.PatchTraces(Google.Cloud.Trace.V1.PatchTracesRequest, Google.Api.Gax.Grpc.CallSettings)
000000ebeaa7e580 00007ff937dc14ae Google.Cloud.Diagnostics.Common.GrpcTraceConsumer.Receive(System.Collections.Generic.IEnumerable`1)
000000ebeaa7e5e0 00007ff937dc1347 Google.Cloud.Diagnostics.Common.RpcRetryConsumer`1[[System.__Canon, System.Private.CoreLib]].Receive(System.Collections.Generic.IEnumerable`1)
000000ebeaa7e630 00007ff936fbea73 Google.Cloud.Diagnostics.Common.TimedBufferingConsumer`1[[System.__Canon, System.Private.CoreLib]].FlushWithSemaphoreHeld()
000000ebeaa7e690 00007ff936fbe99d Google.Cloud.Diagnostics.Common.FlushableConsumerBase`1[[System.__Canon, System.Private.CoreLib]].Flush()
000000ebeaa7e6d0 00007ff936fbe913 Google.Cloud.Diagnostics.Common.TimedBufferingConsumer`1[[System.__Canon, System.Private.CoreLib]].b__3_0(System.Object)
000000ebeaa7e710 00007ff99074cda9 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
000000ebeaa7e790 00007ff99083f40c System.Threading.TimerQueueTimer.CallCallback()
000000ebeaa7e7f0 00007ff99083f20a System.Threading.TimerQueueTimer.Fire()
000000ebeaa7e860 00007ff990836a8c System.Threading.ThreadPoolWorkQueue.Dispatch()
000000ebeaa7ece0 00007ff99443f643 [DebuggerU2MCatchHandlerFrame: 000000ebeaa7ece0] 

@amanda-tarafa
Copy link
Contributor

I'll take a closer look before end of week and also try and see if I can reproduce something like this.

@moserware
Copy link

moserware commented Jan 11, 2019

Thanks for looking into this @amanda-tarafa ! We now have some more info. To summarize the issue so far:

  1. We're processing hundreds of requests that are each doing their Cloud Trace root Span (which has many child Spans).
  2. When each request finishes, it takes the semaphore so that it can add its proto to the end of a List queue and then completes.
  3. After the tracing middleware finishes, it chains to other middleware that do things like .Dispose'ing our scoped DI container (which will do things like .Disposeing an EF context which will free up a DB connection)
  4. Periodically (e.g. via a Timer) the buffered list is flushed by forwarding its protos to the server via a gRPC call while holding the semaphore.
  5. The gRPC call sometimes takes awhile, possibly because it needs to be retried (possibly due to quota exceeds) and/or possibly with a large payload (or maybe some other reason).

The problem is that if the single underlying gRPC consumer Receive inside a FlushWithSemaphoreHeld stalls (for whatever reason), then all requests start to back up (because they all are gated by the shared semaphore), which starts to put pressure on other things (in our case our DB connection pool) because they're not being Disposed because that's waiting on the in-memory list add to finish because that is using the same SemaphoreSlim. One possible fix to consider is have the FlushWithSemaphoreHeld portion just do something like:

protected override void FlushWithSemaphoreHeld()
{
    if (_items.Count == 0)
    {
        return;
    }

    _nextBatch = Interlocked.Exchange(ref _items, new List<T>());
}

protected override void FlushAfterSemaphoreReleased()
{
    _consumer.Receive(_nextBatch);
}

Then you'd need to update the FlushableConsumerBase<T>.Flush (and -Async ones) to do something like:

public void Flush()
{
    _semaphore.Wait();
    try
    {
        FlushWithSemaphoreHeld();
    }
    finally
    {
        _semaphore.Release();
    }
    FlushAfterSemaphoreReleased();
}

One possible downside to this approach is that this could trade a semaphore stall problem for a memory explosion problem. You could potentially work around this by using something like a bounded System.Threading.Channel<T> instead of a in-memory List. If you go with the channel approach, you could avoid needing your own semaphore.

(For now, we've decided to disable the Cloud Trace middleware to help with stability)

@erdalsivri
Copy link
Author

I would suggest Flush doing the Wait on the Semaphore with a timeout so that flushes don't pile up. Timer will keep ticking even though a previous Flush is still going on. Instead of Timer, there is SequentialThreadingTimer , which could be better because then flushes won't pile up and waits would be async.

@amanda-tarafa
Copy link
Contributor

Thanks both for the extra info and suggestions which all make sense. I've been trying to reproduce the pileup of threads waiting on the semaphore without much luck but I see how it could happen. And I agree that at least async waiting on the semaphore and releasing the shared semaphore for actually sending the traces to Stackdriver will aliviate the problem with request pileups and thread explosion. I'll chat with the team early next week and start working on it.
In the meantime and as a workaround there's a tracer option called QPS that will reduce the amount of traces collected, you can see how to use it on some of these integration tests:

@moserware
Copy link

Thanks! In order to more easily reproduce the issue, you might consider mocking-out the IConsumer<T> that is used by TimedBufferingConsumer (and likely also with the size-based one too). From stack traces, it looks like it then calls RpcRetryConsumer => GrpcTraceConsumer => TraceServiceClient. From a brief look at the code, the only seam I could find would be to mock out TraceServiceClient to add notable delay to the write calls. Alternatively, you could have your own TraceServiceClient proxy that calls the real one but with extra Task.Delay added to it.

@amanda-tarafa amanda-tarafa added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. api: cloudtrace Issues related to the Cloud Trace API. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. status: investigating The issue is under investigation, which is determined to be non-trivial. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. labels Jan 14, 2019
@amanda-tarafa
Copy link
Contributor

Just as an FYI, I'm working on this.

@amanda-tarafa
Copy link
Contributor

I've just sumitted PR #2836. We are going to start small and then go from there if needed, so the first thing we are doing is not blocking the flushes on the receive semaphore. You can take a look and leave your comments there.
While doing that I also found this bug #2836 that meant that you were probably missing some traces, although I don't think it was in any way provoking the thread explosion that you were seing.

@moserware
Copy link

Thanks @amanda-tarafa ! I left a few comments on #2836

@amanda-tarafa
Copy link
Contributor

I've pushed #2836 and you'll hopefully see this issue solved entirely or greatly improved. I'm working on a couple other PRs and plan to do a release tomorrow so you can test this. I will update here once the release is done.

@amanda-tarafa
Copy link
Contributor

The fix in #2836 is now included in release 3.0.0-beta08. If you cna try that out and let us know if you can see improvements that'd be great. I'll be closing this issue for now, but feel free to reopen at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: cloudtrace Issues related to the Cloud Trace API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants