-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metricbeat][System]system.process.state
reports sleeping
#38120
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
Pinging @elastic/elastic-agent (Team:Elastic-Agent) |
Sorry @tomgregoryelastic, we had to move this to another repo. Are you able to provide any detail of the customer who is currently asking about this? Even better, they could directly communicate on this issue? Anything to help us understand if this is a bug (and if so, what is the root cause)... |
Sleeping is a valid Linux process state, and is therefore in the list of process states our system metrics code can report: https://github.com/elastic/elastic-agent-system-metrics/blob/01e9cf49607b993bb398e7925274fffcf5128304/metric/system/process/process_common.go#L139-L153 // PidStates is a Map of all pid states, mostly applicable to linux
var PidStates = map[byte]PidState{
'S': Sleeping,
'R': Running,
'D': DiskSleep, // Waiting in uninterruptible disk sleep, on some kernels this is marked as I below
'I': Idle, // in the scheduler, TASK_IDLE is defined as (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
'T': Stopped,
'Z': Zombie,
'X': Dead,
'x': Dead,
'K': WakeKill,
'W': Waking,
'P': Parked,
} https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf explains what sleeping means, here is part of the explanation:
To find out if this is a bug, the first thing to check would be if other Linux tools like If they do, then this is correct. If they don't, then this is something we can investigate. I should note that the process state will change over time, what is important is that the states we are reporting to Elasticsearch and Kibana match what was happening on the system at that time. |
@cmacknz there was recently an SDH that touched upon this very issue: https://github.com/elastic/sdh-kibana/issues/4598. There are more details in there, which might be useful to determine what might be happening. |
I'm also encountering this. In almost all my tests I'm seeing a sleeping status on RHEL and Ubuntu with the only exception being a few zombie processes and one case of seeing a running process today. They're running when I check top and event Elastic Agent and its child processes show as sleeping. One of the agent processes was the one I did see running today. @crespocarlos I can't seem to view that repo, but if there's anything helpful in there could it be shared? I'd be glad to provide any additional datapoints if they can assist in resolving this. The two directions I was leaning towards (at least before seeing the one running process) were A) the calls being blocked by AppArmor as described in the Linux Metrics
If these are the only calls being made I am willing to write up a profile for it. When I tried last time I got hung up on the path (hash changes between versions). I'm not seeing any AppArmor events for elastic in auditd though. I see other processes so it may not be this. B) I increased the interval to cut down in log volume and it's very possible that a 30 second interval just is almost never catching these. |
This definitely seems wrong. @fearful-symmetry any ideas on what might be causing this? |
So, at least for certain workloads, this isn't too weird, at any given time the majority of processes on a system are likely to be in an
I suspect that for a great many user workloads, "how many processes are in a running state?" is not useful information; for example, while I ran that above command, the server was performing a build operation, and using between 50-99% CPU. However, that's a separate issue. (note that the
I'm a tad confused by the wording here. Are you saying that the child processes of agent itself were incorrectly reporting as the wrong state? If this is happening while monitoring the host system under docker, it may be related to a bug I found yesterday. I don't see anything in the code that would cause us to fallback to an Note that it's also possible for beats to "miss" a PID's state if it doesn't line up with the period that metricbeat is running with. For example, if you have a process that's mostly is in a If a user can verify that there's a consistent error in the reported state for a PID, the best route would be to increase the |
@fearful-symmetry oh sorry. I meant that However, I do agree that the likely cause is really just timing of when it runs. This plus limited number of top n processes + a third to half of them--at least on our hosts--being used up by Elastic Agent and any of the beats processes it launches are obscuring anything that could be in a running state when metricbeat collects its metrics. This ends up giving the appearance that nothing is actually running and that something is broken. Grabbing some examples, I did run into other cases where something was definitely wrong though. Here on one host 9/10 of the top processes for memory are elastic and 9/10 of the top processes by cpu are also elastic. There is one running process but it's not in the top n. In some cases, there do seem to be actual errors. Here, I was able to capture something with a running status but the count is Kibana, regardless of the starting timestamp, say it's the 1 minute preceding the end timestamp, and that it's aggregated.
Is it possible that the issue is the aggregation itself and how it handles status changes? |
Oh also, not on docker. Though, noting they're Ubuntu and RHEL (7/9) VMs on ESXi hosts may help if it's a specific virtualization thing here as well. With a bit of testing, and looking at logs. Sleeping vs running occurs in about 98-99% sleeping & 1-2% running for of Running Got periodic output that I could compare to the host in kibana. Not the most reliable method because of timing but this output would consistently remain at 238±2. Within Kibana is idle, zombie, and sleeping are matches to the output. Except it shows
|
Yeah, I get the feeling that we're running into the limits of how aggregation works. Or perhaps it's just a bug? I suspect that a certain amount of this is also metricbeat missing processes that happen between query periods.
In this case, the It's also important to note that |
I did a bit of digging last night and now I don't think it's the aggregation.
I can't tell if there's an issue in Metricbeat, the elastic library, or the Go library. Maybe it's timing again, how it's interpreting what is or isn't flapping (though that was in heartbeat I think, not Metricbeat), or somewhere along the line it's also parsing statuses wrong and missing running statuses along the way. It's entirely possible nothing is wrong and, as your issue notes, the dashboard is just confusing in what its actually showing. |
@jvalente-salemstate yeah, as you noted, the visualization depends on both the |
As far as the actual behavior itself, unless we have evidence of a bug, I think our best conclusion is that it's a combination of how linux scheduling works combined with the limited granularity of the time-based reporting of metricbeat. |
Thanks everyone for the detailed diving on this. I have a question - do we know if, given the correct configuration, this would reasonably replicate the what a As PM on the UI feature here, I'm trying to figure out:
We do fairly consistently get negative feedback about the 'accuracy' of this and I'm trying to figure out what options we might have at our disposal to address why users are complaining. I'm wondering if we're in a position to suggest some paths forward? |
I think the best solution may just be removing or heavily reworking the process summary. On the rework end, I'm not particularly sure what a good alternative is. Maybe modifying the aggregation to look back a bit more and break down process counts for what's still "active" and what have been killed in that period, and maybe the same metric on the count of distinct parent pid? The way it's reported does require a more in depth understanding of both Windows and Linux as the results swing to opposing extremes.
So sleeping vs running on Linux has little information of use while Running vs anything on Windows has the same issue. Personally, I would see more value in removing those metrics and just adding more context and data to the topN metrics. I'll keep it light unless you'd prefer it here and not a separate FR but at a high level:
|
Version: 8.12
Description of the problem including expected versus actual behavior:
Various users have surfaced that processes show as 'sleeping' within the Host 'processes' view which is leading to customers not trusting the 'processes' functionality offered by Elastic.
Example customer issue
Example customer issue
It seems that the majority of processes are reporting as 'sleeping':
Discover - breaking down processes by state (demo cluster)
Host processes (demo cluster)
Could we confirm:
system.process.state
fieldSteps to reproduce:
process.state
fields being emitted in Discover (or the host experience)Note : We'll try and get some direct customer feedback to help debug this
The text was updated successfully, but these errors were encountered: