-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endless GET traces on ApplicationInsights #2919
Comments
Is the SDK configured to use gateway mode? |
Sorry, but I didn't get your question. We are in DirectMode, we initialize the client with:
Isn't this enough? Do other configurations need to be set? |
Sorry, I missed that it was set to direct mode in the sample. The Cosmos DB SDK should only do http calls to initialize and update the caches. Are you by any chance doing a database or container operation like read or create if not exists on every item operation? |
No, we decided to have manual control over these resources. This helper is called in startup and register as a singleton the Document Client. The base class used to operate on the container open the client in this way:
The repositories are InstancePerLifetimeScope with Autofac. |
Your code is creating one Client per database name, are you handling multiple databases in your application? Ideally you should have 1 client per account, the client can be used to interact with any database and container within the account. Also, how frequently is the GET you are seeing in the screenshot? If it's every 5 minutes, the SDK has a background Task that refreshes the account information periodically. I don't know why App Insights renders it in a nested way, we don't have any App Insights specific code or references, this is completely how App Insights interprets the requests. |
I confirm that they are 5 minutes each other. But why are they linked to the same operationId of my operation and why does this behavior differ in the different environments?
Each microservice will manage only one database, this is reason why we initialize a Database client in the startup and register it as a Singleton. From that Database client, each "Repository" gets its client to the Container. |
This seems to be a behavior of App Insights. The background refresh is not related to any user operation, it is an independent Task that runs on the background and performs 1 HTTP GET every 5 minutes: https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/Routing/GlobalEndpointManager.cs#L462
Ideally you'd like to have the CosmosClient as the Singleton, because your model works well because the application accesses 1 Database, but if at some point it accesses 2 or more, you are now generating more clients than needed. If you have the CosmosClient as Singleton, you can always do client.GetContainer(databaseName, containerName). |
Do all of the requests have the same activity id? |
I know, I have taken this simplification just because of the architectural choice. I wanted to hide the concept on Cosmos instance, on top of the Database, to the dev team that will implement the repositories. But thanks for letting me know how to use it in the best way.
From a client-side perspective, nothing is changed, we didn't update the ApplicationInsights SDK. And I don't think that this could be related to a specific change on the server-side. So my doubts remains:
Do you suggest opening an issue on AI SDK to investigate it on their side? |
Did you update the Cosmos DB SDK? If not, then the Cosmos SDK behavior did not change either, this background refresh has been there since the beginning. If nothing on the client application changed and this started to appear on the AI UI suddenly, then it means something changed on AI that is now identifying this activity as related to user requests, when it's not. |
We updated our Cosmos SDK on the date 2021-11-30 (but from the traces the problem was preset way before) and the ApplicationInsight SDK on the date 2020-12-18 (I know, it is time to update). So the updates aren't the problem. All our environments have almost the same creation date and the same configuration. It seems that there is no reason why we experience these different behaviors. I could simply filter out these diagnostics with an ITelemetryProcessor but it worries me that I didn't understand the root of this. |
When the SDK initializes (happens on the first request ever), that's when the background worker starts. It is easy to identify these requests because they are HTTP GETs to the The only logical explanation is that AI is, in their engine, correlating the first ever request of a new client instance with the kickoff of the background worker and then it correlates all future executions of the worker with that initial first request. One way to rule that out would be to do a dummy request when you create the client (outside of your hot path), or use the optimization of the client that warms up the client before any requests (https://devblogs.microsoft.com/cosmosdb/improve-net-sdk-initialization/). If doing any of those things gets rid of the ever-growing nested requests, then it confirms AI is correlating the background worker work with the first operation. |
I've experimented a bit with one of our microservices in a DEV environment. The dummy request seems working. I continue to not understand why in production it is different. The instances are always created by a real request. But if this could solve the problem I will had to study how could we implement the CreateAndInitializeAsync. This API wants the list of containers. But when we create the client is much earlier and the containers used aren't known a priori. Our central component that is provided to the different teams is the one that registers the Database client. They use this instance to get a reference to the Container in their repositories. So a good amount of work is needed. The CreateAndInitializeAsync isn't also available with the CosmosClientFactory so I had to remove the fluent initialization that is much more elegant. I need more time to be sure that this is really the solution. |
I think that is the culprit. Since the instance is created (maybe you are using DI?) on demand on a real request (maybe coming to your API/web project). |
Sorry for the long wait. After some time I have more information. I confirm that as proposed, creating a dummy request detaches all this "GET" from the functional request served by the microservices. This doesn't mean that these dependencies aren't tracked, ApplicationInsights will trace the "GET" even if these are performed out of request context, each one will have a new operation id. Moving from version 3.15 to 3.23 we experienced a drastic increase of "GET" dependencies even in PRODUCTION. Until the new release, we had seen fewer traces. I haven't deepened my understanding of the linked section of code, but I see that a lot of work was done in that class from 3.15 to the actual 3.23. Right now all is registered with DI and the resources are created only when a request income, the use of CreateAndInitializeAsync will require a lot of rework, so we will probably introduce a filter to get rid of them manually. The fact that traces are on logs anyway even if with different operation id could harden analysis creating background noises. But I understand that this is a necessary behavior for the Cosmos SDK, and for some reason, someone could want these traces for debugging unexpected scenarios (it is correct that ApplicationInsights SDK traces them too). So I think that is our specific problem. From my point of view, the issue could be closed, at least that you want to investigate how could be possible to increase the experience from the final user. |
Having all those logs has cost implication. Could we have a way to limit logs (LogLevel applied to CosmosDb SDK via a configuration ?) to only get valuable logs for end users ? |
@YohanSciubukgian which logs are you referring to? The SDK produces Event Traces with different log levels, users can configure their Event Listeners to listen for a particular level. |
@ealsur |
Iam still seeing this, I think the problem is that the parentid and operationid is being captured in the cosmosdb refresh check which causes the logs to be correlated forever. This really messes up the transaction view in app insights now, things that took only a view minutes are now displayed like they took hours or days because of this correlation with alot of additional noise. I think the fix should be that the refresh it should start its own operation instead of capturing the context it is in as the refresh calls (especially after the first) are not logically part of what started the loop. I also don't think this issue has anything to do with app insights as I see it locally in the aspire dashboard (we are using opentelemetry) as well. |
I ran into this issue as well using the OpenTelemetry instrumentation. My Perhaps the SDK would need to disconnect the Activity for a refresh cycle from its parent by calling |
Addig a potential master issue: open-telemetry/opentelemetry-dotnet#984 The OpenTelemetry dotnet SDK has some example code working around the issue: |
We are experiencing strange behavior in our different environments even if they are equally configured. Thanks to ApplicationInsights we encountered traces like:
The services create the CosmosClient as follow (DatabaseClientHelper is used to create a Singleton of the DatabaseClient):
We extracted from AI how many traces of that GET we have on our environments. This is the Kusto query used:
And these graphs are the results from our three developments environments:
Our stage environment (we deploy every 2 week, the last deploy was on 02/12, only today we started experiencing so many traces of GET):
This is from the production environment:
Could you help us understand the root cause of these different traces and how to configure our environments to prevent these endless GET traces?
The text was updated successfully, but these errors were encountered: