-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google.Cloud.Diagnostics: ISpan is doing blocking I/O #2791
Comments
I'll take a look at this tomorrow. |
My colleagues pointed out that the library is doing a timed buffering by default. We use the default settings so the interval seems to be 5 seconds.
|
Hi @erdalsivri, here are some notes on your questions:
The problem with having the That said:
This shouldn't be blocking to send traces to the server unless you have configured Can you share how you have configured |
Oh, it seems we replied to the thread at almost the same time. I'm taking a closer look to see if there is a bug with the |
I think one improvement would be fixing #2182 and the other would be calling I will try to use
|
I've submitted #2792 that I think might help, and that anyways fixes a bug. As for your last comments:
This is non-trivial and that's why it's currently in the backlog. Doesn't mean it is forgotten, just means it's not prioritized or in the active pipeline. But honestly, I don't see how swallowing the exceptions as we are now would make the thread block, and even if the timer thread is blocked by this, your application thread wouldn't be blocked (unless the exception happened when releasing the semaphore which is highly unlikely), so your application would be responding normally. The symptom would be few traces logged.
Again, even if calling
Yes, that seems like a way to find if there are exceptions being thrown at some point.
Yes, that would be similar to what I was proposing of implementing Could you send the following info so we can keep looking at this in more detail:
|
I agree fixing #2182 will in no way affect threads. I just thought it could be useful to see if there were exceptions being eaten up. I understand it may be difficult and it is definitely not urgent for us. You can consider this as a "me too" for that bug.
Sorry I wasn't clear when I said threads getting blocked. Our application request threads are not blocked but we see many blocked threads in the dump. I assume
I don't mind if one out of a thousand requests gets a bit delayed to post the traces (assuming not using
Responses are fine. As I mentioned above, I didn't explain myself very well before. We experienced high thread count in our application and saw
I will try to collect more data or even try to reproduce it. Feel free to close the bug until that (hope I can). Thanks! |
OK, I think I see your concern now, which is "mostly" with not having that many threats blocked. Yes, "asyncing" everything will have the effect of threads being returned to the pool, but those blocked tasks (tasks not threads) will still be "blocked" waiting on the semaphore (if that is indeed what is happening) and we won't know why it's happening either. Admittedly the amount of blocked tasks (and threads at the moment) might be a natural occurrence due to some network errors or even potentially to a surge in requests (as per your original post the blocking is happening on the middleware which triggers per request). That's why I was asking about the ratio of blocked threads vs requests and the age of the oldest blocked thread. But also the amount of blocked tasks might be because of a bug (I haven't been able to find anything suspicious though, the semaphore seems to be released properly). If there's a bug I'd really like to find it. |
Makes total sense thanks! In the meantime, I will try to collect some data. |
I have more information. Below is a sample stack trace extracted from a dump. We have about 600 stack traces like this. Looks like CloudTracer cannot keep up with the amount of traces we produce so flushing them is taking a long time (not sure, just guessing). Maybe the solution is to add a timeout to For context, I should probably mention that our application may have a configuration error or other bugs. We are trying to migrate an Asp.Net application to Asp.Net Core and running canaries, which explode because of high number of threads or sometimes because of other issues.
We also have threads blocked on
There are 132 stack traces with |
OK, that makes sense, the fact that is blocking on both Receive and Flush I mean because the Semaphore is the same for both methods. Now, my big question is, if it is blocking on Receive, which definetely happens on user request, and you are not seing responses affected, could this be just normal behaviour? I mean, could this be that you are just getting lots of requests and the competition for the Semaphore is just due to the volume of requests? In that case then it would make even more sense to try and have all semaphore waiting done async so as not to create and block so many threads. Knowing the age of the oldest blocked thread or the average time these threads stay blocked would be good indicators of whether this is natural or due to a bug. |
It definitely affected the requests but I think we just didn't notice it. Our canary failed miserably for other reasons (related to db connections) so CloudTracer delaying a few requests didn't catch our attention. My guess is that receives are blocking on the semaphore because of flushes because receive is very simple, just adding a new item to a list (it shouldn't really cause contention). However flush is actually sending the traces to the server. And there are in fact threads blocked on receive in the dump. I don't know how many QPS we were receiving at that time but flushes happen every 5 seconds regardless of QPS and they seem to be blocking the receives (just guessing). Making the semaphore blocking async (my initial request) could definitely help with the thread explosion but as you mentioned maybe there is a bigger issue here. I think flush needs to release the semaphore after taking a copy of items and hold on to a separate I am speculating a bit based on the stack traces. My speculations are based on the fact that flush is taking a long time and they keep piling up. I will try to find a stack trace in which a flush is blocked on actually sending the data (I/O); the ones I've seen so far are just blocked on the semaphore. I will update the bug if I find anything interesting. |
Update: I've finally found the thread with a
|
I'll take a closer look before end of week and also try and see if I can reproduce something like this. |
Thanks for looking into this @amanda-tarafa ! We now have some more info. To summarize the issue so far:
The problem is that if the single underlying gRPC consumer
Then you'd need to update the
One possible downside to this approach is that this could trade a semaphore stall problem for a memory explosion problem. You could potentially work around this by using something like a bounded (For now, we've decided to disable the Cloud Trace middleware to help with stability) |
I would suggest |
Thanks both for the extra info and suggestions which all make sense. I've been trying to reproduce the pileup of threads waiting on the semaphore without much luck but I see how it could happen. And I agree that at least async waiting on the semaphore and releasing the shared semaphore for actually sending the traces to Stackdriver will aliviate the problem with request pileups and thread explosion. I'll chat with the team early next week and start working on it. |
Thanks! In order to more easily reproduce the issue, you might consider mocking-out the |
Just as an FYI, I'm working on this. |
I've just sumitted PR #2836. We are going to start small and then go from there if needed, so the first thing we are doing is not blocking the flushes on the receive semaphore. You can take a look and leave your comments there. |
Thanks @amanda-tarafa ! I left a few comments on #2836 |
I've pushed #2836 and you'll hopefully see this issue solved entirely or greatly improved. I'm working on a couple other PRs and plan to do a release tomorrow so you can test this. I will update here once the release is done. |
The fix in #2836 is now included in release 3.0.0-beta08. If you cna try that out and let us know if you can see improvements that'd be great. I'll be closing this issue for now, but feel free to reopen at any time. |
We are using Google.Cloud.Diagnostics.{Common,AspNetCore} packages for our Asp.Net Core web application. Google Cloud tracer seems to be doing a blocking flush to send traces to the server. On production we've identified several hundreds of threads having Google.Cloud.Diagnostics.Common.FlushableConsumerBase in their stack trace.
We are using CloudTraceMiddleware, which calls
Dispose
on thespan
object. Here is the complete call chain:CloudTraceMiddleware.Invoke
->Span.Dispose
->SimpleManagerTracer.EndSpan
->SimpleManagerTracer.Flush
->IConsumer<TraceProto>.Receive
There is an async version of the
Receive
method but it is not used by theSpan
class. Is there a way to make spans make non-blocking calls?Environment details
Thanks!
The text was updated successfully, but these errors were encountered: