You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have picked up on an issue whereby the instantiation of powershell.exe in Windows Server environments by nri-flex.exe is consuming 100% CPU for the duration of the powershell.exe process. This is preventing other critical processes from running during this time window as the powershell.exe process seems to be given priority over other processes. However, when inspected, the powershell.exe process is running with 'normal' priority on the affected system.
Note: This behaviour is observed on servers of various performance specifications. However, it is only a problem on servers that are either already under high load or only have one or two CPUs. Allocated memory does not seem to be a significant factor.
Work Around 1
Using the Windows Services Flex example, a quick and dirty work around seems to be adding a “(Get-Process -Id $pid).PriorityClass = 'Idle';” onto the start of the flex commands like so:
The CPU usage still spikes up to 100% but then graciously gives way to other processes running on the host when needed. However, as a long-term fix, this isn’t ideal. After a few stress tests, it seems that the Flex will time out before returning any data when the server is under load. While it might be possible to increase the time out window on Flex integrations, this will have an inherent risk whereby the necessary increase in flex interval may lead to critical monitoring data being missed in more sensitive environments.
Work Around 2
This might also be potentially a separate issue in and of itself but can also be a work around none-the-less. Since nri-flex.exe seems to run multiple consecutive powershell.exe processes, one after the other, contributing to issue as the starting and stopping of many process in a short window has effects in and of itself.
Using the Windows Services example Flex in this case again: It is possible to refence a lookup.json file directly from inside a PowerShell script, then using a JSON array, converted into a hash table, batch all Windows Service status checks into one powershell.exe process, rather than launching multiple, optionally be combined with Work Around 1. Like so:
Though this approach still doesn’t prevent PowerShell from consuming 100% CPU, it does drastically reduce the amount of time spent at 100%, and when combined with the first workaround option does also allow for graciously giving way to other processes running on the host.
It doesn't seem to be possible to reference a JSON array using the usual ${lf:variable} approach. When tested this seems to concatenate the items listed in the JSON array onto one, long string. As such it is necessary in this approach to reference the JSON directly within the script.
Risks
For both work arounds, more sensitive environments risk missing data due to increased timeout windows, and subsequent delays in alerting time. This becomes even more mission critical when generated alerts are used as triggers for automated remediation processes or other time sensitive tasks.
Expected Behavior
Instantiated powershell.exe processes triggered by nri-flex.exe should graciously give way to critical processes running on the host in such a manor that doesn't risk the flex timing out during periods of prolonged high load leading to false positive alerts being generated.
Install New Relic Infrastructure agent on Windows Server 2012/19 VM with one or two vCPU's. (confirmed on agent versions 1.23.1 and 1.24.0 haven't yet checked older versions)
Copy down and configure any of the Windows PowerShell related Flex integrations. (Ones with associated lookup files and multiple listed items in the lookup file are the best illustration of the issue.)
Run and observe CPU spikes of 100% and a significant drop in performance on the host for the duration of the spike.
Your Environment
Agent version 1.23.1 & 1.24.0 confirmed.
Windows Server VMs running on VMWare Infrastructure.
Observed on all CPU core counts but only really problematic where only one or two CPUs are available to the OS, though there are exceptions and we have seen had problems with hosts with as much as 4 vCPUs.
Windows Server Datacenter 2012 R2 through Windows Server Datacenter 2019.
Additional context
Was directed to this issue board by a New Relic representative.
The text was updated successfully, but these errors were encountered:
Description
We have picked up on an issue whereby the instantiation of powershell.exe in Windows Server environments by nri-flex.exe is consuming 100% CPU for the duration of the powershell.exe process. This is preventing other critical processes from running during this time window as the powershell.exe process seems to be given priority over other processes. However, when inspected, the powershell.exe process is running with 'normal' priority on the affected system.
Note: This behaviour is observed on servers of various performance specifications. However, it is only a problem on servers that are either already under high load or only have one or two CPUs. Allocated memory does not seem to be a significant factor.
Work Around 1
Using the Windows Services Flex example, a quick and dirty work around seems to be adding a “(Get-Process -Id $pid).PriorityClass = 'Idle';” onto the start of the flex commands like so:
The CPU usage still spikes up to 100% but then graciously gives way to other processes running on the host when needed. However, as a long-term fix, this isn’t ideal. After a few stress tests, it seems that the Flex will time out before returning any data when the server is under load. While it might be possible to increase the time out window on Flex integrations, this will have an inherent risk whereby the necessary increase in flex interval may lead to critical monitoring data being missed in more sensitive environments.
Work Around 2
This might also be potentially a separate issue in and of itself but can also be a work around none-the-less. Since nri-flex.exe seems to run multiple consecutive powershell.exe processes, one after the other, contributing to issue as the starting and stopping of many process in a short window has effects in and of itself.
Using the Windows Services example Flex in this case again: It is possible to refence a lookup.json file directly from inside a PowerShell script, then using a JSON array, converted into a hash table, batch all Windows Service status checks into one powershell.exe process, rather than launching multiple, optionally be combined with Work Around 1. Like so:
newrelic-infra-flex-windows-services.yml
Get-WindowsServiceStatus.ps1
Get-WindowsServiceStatus-Config.json
Though this approach still doesn’t prevent PowerShell from consuming 100% CPU, it does drastically reduce the amount of time spent at 100%, and when combined with the first workaround option does also allow for graciously giving way to other processes running on the host.
It doesn't seem to be possible to reference a JSON array using the usual ${lf:variable} approach. When tested this seems to concatenate the items listed in the JSON array onto one, long string. As such it is necessary in this approach to reference the JSON directly within the script.
Risks
For both work arounds, more sensitive environments risk missing data due to increased timeout windows, and subsequent delays in alerting time. This becomes even more mission critical when generated alerts are used as triggers for automated remediation processes or other time sensitive tasks.
Expected Behavior
Instantiated powershell.exe processes triggered by nri-flex.exe should graciously give way to critical processes running on the host in such a manor that doesn't risk the flex timing out during periods of prolonged high load leading to false positive alerts being generated.
Troubleshooting or NR Diag results
See workarounds above.
Steps to Reproduce
Your Environment
Additional context
Was directed to this issue board by a New Relic representative.
The text was updated successfully, but these errors were encountered: