-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing AppTraces in Log Analytics Workspace #2129
Comments
did you check App Insights' |
hi @stevendick-swissre, can you try with the latest (3.2.6)? 3.2.4 started logging when telemetry overflowed an internal queue and got dropped 3.2.5 fixed an issue that was discovered by this new logging (#2062) (no change in 3.2.6 that I think would impact this, but worth going to it if already bumping) |
As we had logging from immediately before and after the missing entries, I am assuming no data cap was breached. I have confirmed that the LAW did not have a data cap breach by checking the _logoperation() function. How can I check if a data cap on the App Insights instance is exceeded? I don't want to share the ikey in a public forum or via email. I have asked the developer to upgrade to v3.2.6. |
We updated to 3.2.6 but reported the same problem. We did see some new output from the App Insights agent: 2022-02-21 12:17:21.606Z WARN c.m.a.a.i.t.BatchSpanProcessor - In the last 5 minutes, the following operation has failed 295785 times (out of 571670): Queuing span:
This means that the agent failed to publish to App Insights? This suggests a network issue or problem on App Insights? |
this means that the single export thread in the JVM can't keep up with the volume of telemetry data being produced. up to 2048 telemetry records are buffered (by default), but once that limit is reached, if it still can't keep up it starts dropping telemetry records. from the warning message, it looks like you are sending 571670 telemetry records over a 5 minute window, which is ~2000 telemetry records per second. and it looks like roughly half of those records are being dropped. can you check the distribution of records that are being ingested? it may help to know the distribution across requests, dependencies, traces, customEvents, and anything else, at least to confirm that this volume is expected. customMetrics should have their own (larger) queue now in 3.2.6, but your warning message points to the "general" queue, which is the queue for all other telemetry records. another factor is how long the ingestion service takes to respond, since the single export thread is throttled on waiting for the response. if you can email me your instrumentation key, I can check our internal data to see if this is likely an issue or not, though I suspect at a sustained rate of ~2000 telemetry records per second that we may just need to bump the number of export threads |
Given we're using the agent, is the option to configure the number of export threads exposed? How do we check the distribution of records being ingested? |
here's a query to check the distribution of records, if you can run it over a timeframe that represents one of the heavy 5 minute windows that would be ideal:
if you can email me your instrumentation key ([email protected]), I can check what your ingestion service response times are, which will give us another clue about the best solution here |
app-insights-ingested.csv |
thanks @stevendick-swissre can you try with this SNAPSHOT build? https://github.com/microsoft/ApplicationInsights-Java/suites/5427505526/artifacts/171896362 it still only uses a single export thread, but that thread should no longer wait for the response from the ingestion service before proceeding to export the next batch |
hey @stevendick-swissre, I think #2181 is a really good improvement that we can include in an upcoming release. In a simple local logging test, I can get ~30x(!) more throughput without dropping any telemetry (still using the single thread, just not blocking on responses from the ingestion service). would you be able to test this out if we release 3.2.9-BETA with this one change? |
I think it likely we can test this, but I'm waiting on feedback from the developer. The potential issue is the impacted component may not need to load data at volume again due to where we are in the testing cycle. |
Hi @trask, I can test this if you can point me to the 3.2.9-BETA release. |
asproll is the developer in question on our project who discovered the issue. |
great, we will plan to release 3.2.9-BETA tomorrow |
@asproll @stevendick-swissre 3.2.9 is released: https://github.com/microsoft/ApplicationInsights-Java/releases |
Hi @trask , thanks a lot. We will test this as soon as we can and let you know. |
@trask, I was able to successfully test the release. It solved our problem with the missing log statements. Now all the expected traces were logged. Thanks a lot. One new thing I noticed, we keep getting the following message in the log, not sure what it means and if its indicating some real problem: WARN c.m.a.a.i.t.BatchSpanProcessor - In the last 5 minutes, the following operation has failed 1 times (out of 13): Add async export: * Max number of concurrent exports 1 has been hit, may see some export throttling due to this (1 times) |
This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 7 days. It will be closed if no further activity occurs within 7 days of this comment. |
Closing. |
Expected behavior
We have an application that is logging sequential progresson of a work package comprising 18819 items.
For example, we log the following at INFO level:
2022-02-17 07:47:24.257 INFO 1 — [-StreamThread-6] c.s.g.i.b.p.k.t.p.PackageDoneTransformer : {"RUN_ID": 1071282}: Status update (17202/18819) received [{"PACKAGE_ID": "1071282|Close|Close|IFA|20220701|300532312|", "RUN_ID": 1071282, "CREINS_ID": null, "TREATY": null, "REPUN_1": null, "REPUN_2": null, "CoverageId": null, "BusinessDate": null, "CB_TO_DATE": null, "PackageCount": 18819, "RecordCountTotal": 83, "RecordCountProcessed": 0, "RecordCountError": 0, "OutputCountAms": 0, "OutputCountStp": 0, "OutputCountTav": 0, "LastUpdated": null, "Errors": []}]
This was scraped from the pod log output in AKS.
Actual behavior
The above example never makes it to the Log Analytics Workspace.
We do see log statements from before and after.
Item 17201 is logged, but we don't see any items logged again until 17217.
What we've checked
The loss of log messages does not coincide with data cap resets on either the LAW or App insights agent.
Sampling is 100% on the App Insights instance.
We don't see any App Insights network/connection errors reported in the pod log.
I have previously seen that there can be extra latency with ingestion that causes the TimeGenerated to be later than expected, but I don't see any evidence of this.
Goal
What else can we check that might contribute to the missing logging?
Is this approach to data capture with the App Insights agent not recommended as we cannot guarantee 100% log data capture?
System information
Please provide the following information:
The text was updated successfully, but these errors were encountered: