-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
win_perf_counters - measurement gaps #14097
Comments
Hi,
That would explain why there is no error and we simply return so quickly. Looking at what modifies
Could you possible set |
It's a testconfig, so yes, I could do that. For the affected production, I don't think we can do that. It's process monitoring, and if I understand correctly the refresh is neccessary to add new processes to the monitoring and remove dead ones. |
So I did that and set the But now I am getting TONS of |
Thanks for trying that out! So we have our light switch, that the refresh interval is getting hit, clearing out hostCounters.
Agreed, I propose we look at the refresh logic here. It appers to correctly zero things out, but the subsequent calls seem to indicate that we attempt to reset the host counters, collect data, and sleep. I would want to see what ParseConfig is producing. Are we getting anything back from that? Or do we have to wait for a second refresh interval to trigger? |
How would I check that? |
Are you able to build a custom telegraf? Essentially would need to add some logging since I don't think debugging this live is a viable option. We would want to focus on the area I linked to above, specifically to see what the state of |
release-1.27...knollet:telegraf:debug-14097 So, I am running this debug modification now and wait for gaps. |
Usual:First a bunch of "Valid:"s
Then a bunch of found counters in the refresh:
Measurement Gap:No "Valid:"s at all
Mind that my diff (linked above) is not against master but against the version we were running: tag release-1.27 |
Thank you for trying this out!
This would seem to indicate that we did not get any counters from Windows? Did you see this same output for 3 collection intervals? |
No. I put the "REFRESHED"-Line into the refresh-block, so it's only displayed when a refresh runs. The usual gather runs (not the refreshs) don't update the counters. So it's quite clear, I think. There are two jobs:
I had some Ideas about that:
So I am at a loss... |
I agree in principle with your description of the two phases.
This is what I was suggesting in my last comment ;) During that refresh cycle, the only reason it should still be empty is if we get nothing back from ParseConfig. The call to AddItem specifically is what modifies the My next step would be to dig into that AddItem and see if AddItem even called? If it is not called, then I agree with your next thought, that there is some Windows bug or delay. If it is called, then we need to dig into that function with greater detail.
Do you have multiple instances of win_perf_counters plugin? |
So. I modified the debug-prints. I ran it on a different machine (where I built it), Win11 this time, and I got a gap: this is how it should be
this is how the gap looks now
Finding: These two seem to be missing with the gap
exerpt from my code modification
|
Aaannnnd I've got it. Ok, I set
|
Buffer too small. |
This shouldn't be If this error occurs, there are a growing number of Items (processes, in this case), and telegraf silently reports none. I would think that would warrant a warning. The documentation recommending to not enable the warnings is weird in this regard. https://github.com/knollet/telegraf/blob/dc0835ad80493f5ca20fa72d475e4accf1f4c6d2/plugins/inputs/win_perf_counters/performance_query.go#L130 |
@knollet thank you for tracking this down and digging in! If you want to put up a PR I'd be happy to review. Otherwise, I will take a deeper look tomorrow as well. Thanks again! |
I won't be able to address this for the next week, I think. The first relevant thing is, I would say, to set either
The second thing would be to fix this race condition. I have run this now on my machine for some hours, and got 15 gaps. |
Thanks for the heads up and trying the fix. If you get back and it all looks good, let's make that change.
If this issue is resolved, would we still want to make this change? For users using wildcards would they start getting extra warnings if we were to switch the default? Thanks again for your help on this issue! |
So. @powersj , could you have a quick look at the commit above? It compiles and seems to fix the bug (no gaps anymore). I think that not all of them need a loop because I don't think that every one of them is susceptible to a race condition. On the other hand a loop probably wouldn't hurt, so why not put a loop with (I now set it to 5...) some limited retries in all of them? |
Sven and I chatted about this today. He was looking at the Windows docs which mentioned not trusting the reported buffer size. Sounds like your retries, while they might be working is not the ideal fix. Instead, we will look at increasing the buffer each time (i.e. double the buffer size) until the value is completely read. Thank you again for reproducing this and digging into the root cause. It is extremely helpful with these Windows issues :) Sven will look into a PR soon. |
I'd like to, but it seems you didn't test compile your changes before committing?
|
I did, but not on a 32-bit machine... ;-) |
well, ok. that makes sense. I wouldn't've either. |
Give me some seconds to fix that... |
@knollet you should have a binary now... |
... testing ... |
seems to work. Didn't get any new gaps. |
@knollet thank you for your work on narrowing down the issue and testing the fix! |
Relevant telegraf.conf
Logs from Telegraf
System info
Telegraf 1.27.3 (git: HEAD@afcf0133), Windows Server 2022
Docker
No response
Steps to reproduce
Expected behavior
No gaps
Actual behavior
Gaps
Additional info
It always seems to fail thrice in a row, I suspect it's because how the "last time refreshed" is recorded.
6:49:30 - plugin starts, has to refresh
6:49:31 - refreshed and recorded "refreshed - 6:49:31", then doing a gather.
somehow there are maybe no counters?
I believe this for loop
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/win_perf_counters/win_perf_counters.go#L245
is going empty, not producing any measurements and not logging any "Valid:..."s.
Gathering nothing and not looping at all takes time.Duration=000
6:50:00 - No new refresh, and as the refresh at 6:49:30 didn't produce any counters, again no measurements
6:50:30 - The last refresh is not 60s ago, yet, as it is 6:50:30 and last refresh was recorded at 6:49:31.
-> still no measurements.
So:
It seems apparent why there's always a gap of three and roughly where to look for the error, but I can't find it, lest it be windows itself returning an empty process list and no error. (not even PDH_NO_DATA)
We have had this problem for months and many telegraf-versions now. It's a "sometimes" problem (which is always bad, it's a bad look for reliability), so we tried gathering information and solving it on our own, first.
The text was updated successfully, but these errors were encountered: