Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows exporter frequently shows container collector success status as 0 #1473

Closed
raneshiv opened this issue May 9, 2024 · 13 comments · Fixed by #1561
Closed

Windows exporter frequently shows container collector success status as 0 #1473

raneshiv opened this issue May 9, 2024 · 13 comments · Fixed by #1561
Assignees

Comments

@raneshiv
Copy link

raneshiv commented May 9, 2024

Windows exporter frequently shows container collector success status as 0.

 C:\>curl http://localhost:9182/metrics | findstr windows_exporter_collector_success
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                 Dload  Upload   Total   Spent    Left  Speed  
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0# HELP windows_exporter_collector_success windows_exporter: Whether the collector was successful.
# TYPE windows_exporter_collector_success gauge
windows_exporter_collector_success{collector="container"} 0
windows_exporter_collector_success{collector="cpu"} 1
windows_exporter_collector_success{collector="cpu_info"} 1
windows_exporter_collector_success{collector="cs"} 1
windows_exporter_collector_success{collector="logical_disk"} 1
windows_exporter_collector_success{collector="memory"} 1
windows_exporter_collector_success{collector="net"} 1
windows_exporter_collector_success{collector="os"} 1
windows_exporter_collector_success{collector="physical_disk"} 1
windows_exporter_collector_success{collector="process"} 1
windows_exporter_collector_success{collector="service"} 1
windows_exporter_collector_success{collector="system"} 1
windows_exporter_collector_success{collector="textfile"} 1
windows_exporter_collector_success{collector="time"} 1
100 1196k    0 1196k    0     0   897k      0 --:--:--  0:00:01 --:--:--  897k

Following error logs which are being observed below in windows-exporter container logs do not exists containers:

  |   |2024-05-09 10:12:08.871 | ts=2024-05-09T10:12:08.871Z caller=prometheus.go:168 level=warn msg="Collection timed out, still waiting for [container]" | 
  |   | 2024-05-09 10:12:16.206 | ts=2024-05-09T10:12:16.197Z caller=container.go:345 level=warn collector=container msg="Failed to collect network stats for container 30d694b1bb2dbe475f97bae02cdb6bb17d659f967c09bde7bea57bba9e354bf6" |  
  |   | 2024-05-09 10:12:16.208 | ts=2024-05-09T10:12:16.207Z caller=container.go:345 level=warn collector=container msg="Failed to collect network stats for container 6ed502afa7dd02f0f04fe7a1819d597ceaffc1ec71b744c6f22fc6e3a5d69ec8" |  
  |   | 2024-05-09 10:12:29.391 | ts=2024-05-09T10:12:29.391Z caller=container.go:246 level=error collector=container msg="err in fetching container Statistics" containerId=80dd707243d5fe6505a5e6da1e37097654d4ab52e895239db5bf3d7799a9fac1 err="container 80dd707243d5fe6505a5e6da1e37097654d4ab52e895239db5bf3d7799a9fac1 encountered an error during hcs::System::Properties: failure in a Windows system call: Element not found. (0x490)" |  
  |   | 2024-05-09 10:12:30.873 | ts=2024-05-09T10:12:30.873Z caller=container.go:345 level=warn collector=container msg="Failed to collect network stats for container 80dd707243d5fe6505a5e6da1e37097654d4ab52e895239db5bf3d7799a9fac1" |  
  |   | 2024-05-09 10:12:59.394 | ts=2024-05-09T10:12:59.394Z caller=container.go:246 level=error collector=container msg="err in fetching container Statistics" containerId=00c560664f3e1046d593bc6ce3a00de51d5060e4ed0ec5d5116498de3ed27dd6 err="container 00c560664f3e1046d593bc6ce3a00de51d5060e4ed0ec5d5116498de3ed27dd6 encountered an error during hcs::System::Properties: failure in a Windows system call: Element not found. (0x490)" |  
  |   | 2024-05-09 10:12:59.442 | ts=2024-05-09T10:12:59.441Z caller=container.go:246 level=error collector=container msg="err in fetching container Statistics" containerId=cc60bc889ae8ea9dfcd2b55f7390b9ca7f9e2735420159dbb6a395769247235a err="container cc60bc889ae8ea9dfcd2b55f7390b9ca7f9e2735420159dbb6a395769247235a encountered an error during hcs::System::Properties: failure in a Windows system call: Element not found. (0x490)"

All the data, metrics are visible on Grafana for the windows dashboard. Kindly suggest how to resolve this issue and what will be causing the container collector status as 0 and the significance of such error logs.

@raneshiv
Copy link
Author

@jkroepke Can you please check and provide opinion regarding the same.

@jkroepke
Copy link
Member

no idea, not an export in windows container monitoring, but there is currently nothing to configure here.

@raneshiv
Copy link
Author

@jsturtevant @breed808 Can you please check once and suggest solution for the same.

@jsturtevant
Copy link
Contributor

There are two difference errors. Its a bit difficult to understand why the error is happening since we don't have a picture of the containers on the system but I would guess it is due to the containers being either in a state of starting up or shutting down or something in between.

The first | | 2024-05-09 10:12:30.873 | ts=2024-05-09T10:12:30.873Z caller=container.go:345 level=warn collector=container msg="Failed to collect network stats for container 80dd707243d5fe6505a5e6da1e37097654d4ab52e895239db5bf3d7799a9fac1" | is a warning message because it is expected that this would happen in some cases, such as a container is removed before we iterate through the stats. Depending on how often this happens, we could adjust the logging level.

The second | | 2024-05-09 10:12:29.391 | ts=2024-05-09T10:12:29.391Z caller=container.go:246 level=error collector=container msg="err in fetching container Statistics" containerId=80dd707243d5fe6505a5e6da1e37097654d4ab52e895239db5bf3d7799a9fac1 err="container 80dd707243d5fe6505a5e6da1e37097654d4ab52e895239db5bf3d7799a9fac1 encountered an error during hcs::System::Properties: failure in a Windows system call: Element not found. (0x490)" | is similiar to what is happen in microsoft/hcsshim#933 and https://github.com/microsoft/hcsshim/pull/934/files. We probably need to handle errors better here so that we don't stop collection in this case or just skip (instead of hard fail) if this function errors.

@jkroepke
Copy link
Member

Instead ignore errors, can we ask before if the container is in ready state?

@jsturtevant
Copy link
Contributor

not really, since there are two different systems working on the same object at the same time. We could query that it is in ready state, but between the time we query and the time we execute the call to get the stats it could begin it's shut down process because kubelet (or what ever is controlling containers) has told it to shut down. This is essentially what is happening now, we should only get containers that are "ready" but when we go to actually execute the statistics command the container was already being shut down.

@raneshiv
Copy link
Author

raneshiv commented May 29, 2024

@jkroepke @jsturtevant The containers are not in ready state. They do not exists at all. If they are already shut down then why
windows_exporter_collector_success is trying to collect that status. Can you explain in detail what 2 different systems do you mean to say that are working. If this is the case can you suggest what solution can be applied to stop the failure status of container collector to fail.

@jsturtevant
Copy link
Contributor

jsturtevant commented May 29, 2024

I am not sure your set up (kubernetes, etc). If you can share that it could help.

In a lot of environments there will be something creating/stop/deleting containers (such as kubelet in kubernetes) and then you have windows_exporter which is monitoring the system. Windows exporter doesn't know about the lifecycle of the containers it just queries the system and collects the stats and reports it back.

The way stats are collected on windows_exporter is a three step process: get all the containers, then get stats for each container, then get the network stats. This means there could be a change on the system between the time it got the containers and stats for that container (ie the container is deleted). Using the kubernetes use case as an example, imagine the following:

In a successful run you will get:

  1. kubelet creates a pod A
  2. windows exporter gets all containers on the system (finds A)
  3. Windows exporter queries stats for container A
  4. Windows exporter succeeds

In a unsuccessful run you will get:

  1. kubelet creates a pod A
  2. windows exporter gets all containers on the system (finds A)
  3. kubelet deletes pod A
  4. Windows exporter queries stats for container A (which has been deleted)
  5. Windows exporter fails

This can happen for starting up a container too where step 1 and 2 and 3 (from scenario 1) happen at microseconds apart and the container hasn't fully booted and doesn't therefor doesn't have stats.

As this is an expected occurrence, particularly in high volume systems, the solution is to handle the errors properly and not error out. An example of that is in the windows containerd shim: https://github.com/jsturtevant/hcsshim/blob/6103d69d1f2604098781c8e848ab196239bb9aa6/cmd/containerd-shim-runhcs-v1/task_wcow_podsandbox.go#L251-L254 and kubelet (which i linked above but the code has been removed from kubelet)

@jkroepke
Copy link
Member

I will implement a toggle where end-users can decide to ignore "Element not found" errors.

@jkroepke
Copy link
Member

In #1473 I decide not implement a toggle, instead the exporter not longer fail hard if a container can't be scraped.

I also take note of microsoft/hcsshim#933 , if the container can't be found the error will be logged as debug message.

Lastly, fetching statistics is now done once. That should also solve the issue related to the warning.

However I have to test the changes in #1473, which make take some time.

@jkroepke
Copy link
Member

Could someone assist here in verify is changes from #1473 are fine?

@jsturtevant
Copy link
Contributor

could someone assist here in verify is changes from #1473 are fine?

#1561 is the PR I believe

@jkroepke
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants