Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encountered channel not found error on adding Windows integration to the Windows agent. #5746

Closed
amolnater-qasource opened this issue Oct 9, 2024 · 16 comments
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Security-Windows Platform Team:Security-Windows Platform

Comments

@amolnater-qasource
Copy link

amolnater-qasource commented Oct 9, 2024

Kibana Build details:

VERSION: 8.16.0 SNAPSHOT
BUILD: 78938
COMMIT: 7b832691e8b07c67b411da95b0398a04711da864

Artifact: https://snapshots.elastic.co/8.16.0-39df64b4/downloads/beats/elastic-agent/elastic-agent-8.16.0-SNAPSHOT-windows-x86_64.zip

Image

Host: Windows Server 2022- Test Signing ON

Preconditions:

  1. 8.16.0 SNAPSHOT Kibana cloud environment should be available.
  2. Agent should be installed with policy having System and Windowsintegrations.

Steps to reproduce:

  1. Navigate to Agents tab.
  2. Observe the Agent gets unhealthy and navigate to policy details page.
  3. Observe error for Windows integration: Encountered channel not found error

Expected Result:
No error should be displayed on adding Windows integration to the Windows agent.

Logs:
elastic-agent-diagnostics-2024-10-09T06-48-15Z-00.zip

Screenshots:
Image
Image

@amolnater-qasource amolnater-qasource added bug Something isn't working impact:medium Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Oct 9, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@amolnater-qasource
Copy link
Author

@muskangulati-qasource Please review.

@muskangulati-qasource
Copy link

Secondary review is done for this ticket!

@cmacknz
Copy link
Member

cmacknz commented Oct 9, 2024

I see this is privileged/admin agent looking in agent-info.yaml:

agent_id: 881c5687-32af-4bf9-b62f-4b74f2f688ec
headers: {}
log_level: info
snapshot: true
unprivileged: false
version: 8.16.0

Also that this is coming from the winlog input. Tagging @nfritts and @elastic/sec-windows-platform.

            input-winlog-default-winlog-windows-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                message: 'Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.'
                payload:
                    streams:
                        winlog-windows.forwarded-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                            error: ""
                            status: HEALTHY
                        winlog-windows.powershell-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                            error: ""
                            status: HEALTHY
                        winlog-windows.powershell_operational-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                            error: ""
                            status: HEALTHY
                        winlog-windows.sysmon_operational-4ea5f67a-48fc-41ea-b586-2a29eac6423a:
                            error: 'Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.'
                            status: DEGRADED

@cmacknz cmacknz added Team:Security-Windows Platform Team:Security-Windows Platform and removed Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Oct 9, 2024
@bjmcnic
Copy link
Contributor

bjmcnic commented Oct 21, 2024

This issue has arisen following changes in this PR (elastic/beats#40163).

The default configuration for for the Windows Integration has historically included Sysmon Operational channel. Sysmon is not a core component of Windows, it's a SysInternals tool (https://learn.microsoft.com/en-us/sysinternals/downloads/sysmon) that users can download and use at their discretion. As such, the Windows Integration has historically failed to open the Sysmon Operational channel, but that didn't propagate a DEGRADED status until the recent PR.

Users can remedy the degraded status by either installing Sysmon, causing the channel to exist; or by deselecting the Sysmon Operational channel in the Windows Integration configuration...

Image

Possible solutions:

  • Revert the code change that causes a DEGRADED status when a configured channel is not found.
    • Obviously that could risk missing a helpful DEGRADED status that could indicate an unexpectedly missing non-default channel.
  • Document and explain how this is now more correct.
  • Change the Windows Integration default to not include Sysmon.

Thoughts @cmacknz @nfritts @andrewkroh ?

@cmacknz
Copy link
Member

cmacknz commented Oct 22, 2024

Change the Windows Integration default to not include Sysmon.

This sounds like the most correct path if sysmon is not expected to be present the majority of the time. The counter argument is that this is a breaking change.

Something we did for some of the system metricsets that were in a similar situation is keep the error message but report the status as healthy, since the input was working as well as it could with the configuration of the host system it was running on.

@jamiehynds
Copy link

In an ideal world, we'd move Sysmon out of Windows and have it as a standalone integration but that'd be very disruptive for existing users and would likely impact rules, dashboards, etc.

As a quick fix, could we exclude Sysmon from our DEGRADED logic? So keep Sysmon within the Windows integration, but if a user doesn't have Sysmon installed, we don't trigger a DEGRADED status?

@intxgo
Copy link
Contributor

intxgo commented Oct 28, 2024

Can't Agent get it from https://live.sysinternals.com/Sysmon64.exe and install if it's missing? When adding policy with sysmon data collection.

@bjmcnic
Copy link
Contributor

bjmcnic commented Oct 28, 2024

@jamiehynds

As a quick fix, could we exclude Sysmon from our DEGRADED logic? So keep Sysmon within the Windows integration, but if a user doesn't have Sysmon installed, we don't trigger a DEGRADED status?

I don't see a technical reason we couldn't just insert a check for whether we're trying to grab that particular channel near the code that changed. But that's filebeat code and not the Windows integration code. That'd be kind of an awkward place for the check in the long term and wouldn't scale well if we want to handle other things differently.

I noticed the PR that changed this was addressing: elastic/beats#39735. Which related to wanting to see failures for channels when permission is denied, typically when Agent is installed unprivileged. I recreated that and now DO see the desired Access is denied error:

c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" status
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a failed state
   ├─ system/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '8300'
   │  └─ system/metrics-default-system/metrics-system-68951c20-696b-4518-b64c-63de7317ef29
   │     └─ status: (DEGRADED) Error fetching data for metricset system.diskio: disk io counters: cannot open new key in the registry in order to enable the performance counters: Access is denied.
   ├─ windows/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '5976'
   │  └─ windows/metrics-default-windows/metrics-windows-e02b045d-e899-4ef1-b6ce-400be3d94119
   │     └─ status: (FAILED) 1 error: initialization of reader failed: failed to expand counter (query='\Process(*)\% Processor Time'): Unable to connect to the specified computer or the computer is offline.
   └─ winlog-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '5416'
      ├─ winlog-default-winlog-system-68951c20-696b-4518-b64c-63de7317ef29
      │  └─ status: (DEGRADED) failed to open Windows Event Log channel "Security": Access is denied.
      └─ winlog-default-winlog-windows-e02b045d-e899-4ef1-b6ce-400be3d94119
         └─ status: (DEGRADED) Encountered channel not found error when opening Windows Event Log: The specified channel could not be found.

The thing that strikes me there is we have multiple types of failures for winlog-default. One we newly want reported that hadn't been, and one we don't want reported that previously hadn't been.

I wonder if a good long term solution would be for the integration to feed in some type of filter data along with each channel it wants. Some enum or struct that tells it to actually degrade on access denied errors for this channel, but ignore not found errors for this channel, or warn/log (but not DEGRADE) for some other type of error for some other channel. Such that the integration can communicate how specific channel subscription failures should impact the integration's state. Trying to place that logic in code in filebeat seems awkward.

I'm not sure that change could be ready for 8.16.0. Perhaps we should rollback the change that caused this issue to arise and continue to tolerate the missing Access is denied as unprivileged appears to still be technically beta. And then we could incorporate the channel specific failure handling from the Windows integration in the next release. Thoughts? @cmacknz @jamiehynds @nfritts

@cmacknz
Copy link
Member

cmacknz commented Oct 28, 2024

Having a per input way to turn off the "errors mark the Beat as degraded" would make sense vs just reverting the entire feature. This config could later expand into a list of specific errors to mute. For system/metrics we had similar ideas in elastic/beats#40543 but it hasn't been implemented yet.

We have been more focused on fixing the specific errors, which in many cases have been actual bugs or permissions errors we were handling improperly. For system/metrics we also only get these errors when unprivileged.

The OTel collector process scraper allows muting specific categories of error which is what we'd eventually want to emulate https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/hostmetricsreceiver/README.md#process

@cmacknz
Copy link
Member

cmacknz commented Oct 28, 2024

@bjmcnic followed up separately and we agreed we should revert the winlog specific change here in 8.16 while we work on a proper fix for 8.17 considering that:

  1. This affects privileged agents for winlog in the default configuration AKA everybody on windows.
  2. The final 8.16.0 BC is Thursday
  3. The linked PR is specific to the winlog input

There's not time to do a more in depth fix and the current state will probably lead to a flood of support cases. The revert PR is in elastic/beats#41468

@cmacknz
Copy link
Member

cmacknz commented Oct 29, 2024

elastic/beats#41468 was merged, this is now reverted from 8.16.

@amolnater-qasource
Copy link
Author

Hi Team,

While testing on 8.17.0 SNAPSHOT, we have found this issue reproducible there too.

Observations:

  • Encountered channel not found error on adding Windows integration to the Windows agent.

Build details:
VERSION: 8.17.0 SNAPSHOT
BUILD: 80188
COMMIT: fdb16ae8cbdf4236db3696aa00d0bb98c943d864
Artifact Link: https://snapshots.elastic.co/8.17.0-7a041bf5/downloads/beats/elastic-agent/elastic-agent-8.17.0-SNAPSHOT-windows-x86_64.zip

Image

Screenshot:
Image

Logs:
elastic-agent-diagnostics-2024-11-18T09-03-35Z-00.zip

Please let us know if us know if anything else is required from our end.

Thanks!

@bjmcnic
Copy link
Contributor

bjmcnic commented Nov 18, 2024

It's fixed in 8.16.0. Looks like the revert of the change was to 8.16 branch, but hasn't hit main.

c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" version
Binary: 8.16.0 (build: 3f07f2fd932f20e972399306d394763ade6b74b4 at 2024-11-07 13:33:43 +0000 UTC)
Daemon: 8.16.0 (build: 3f07f2fd932f20e972399306d394763ade6b74b4 at 2024-11-07 13:33:43 +0000 UTC)

c:\>"c:\Program Files\Elastic\Agent\elastic-agent.exe" status --output full
┌─ fleet
│  └─ status: (HEALTHY) Connected
└─ elastic-agent
   ├─ status: (HEALTHY) Running
   ├─ info
   │  ├─ id: 4eb139dd-aa42-4ec4-9463-39f1b7b37f60
   │  ├─ version: 8.16.0
   │  └─ commit: 3f07f2fd932f20e972399306d394763ade6b74b4
   ├─ beat/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '5128'
   │  ├─ beat/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ beat/metrics-monitoring-metrics-monitoring-beats
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ filestream-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '520'
   │  ├─ filestream-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ filestream-monitoring-filestream-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ http/metrics-monitoring
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '3876'
   │  ├─ http/metrics-monitoring
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ http/metrics-monitoring-metrics-monitoring-agent
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ log-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '5944'
   │  ├─ log-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ log-default-logfile-system-f57fba90-d162-43bb-8f6b-546015d84c78
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ system/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '3500'
   │  ├─ system/metrics-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ system/metrics-default-system/metrics-system-f57fba90-d162-43bb-8f6b-546015d84c78
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   ├─ windows/metrics-default
   │  ├─ status: (HEALTHY) Healthy: communicating with pid '2220'
   │  ├─ windows/metrics-default
   │  │  ├─ status: (HEALTHY) Healthy
   │  │  └─ type: OUTPUT
   │  └─ windows/metrics-default-windows/metrics-windows-e715bcd7-f597-4666-9a17-c75be66c9e02
   │     ├─ status: (HEALTHY) Healthy
   │     └─ type: INPUT
   └─ winlog-default
      ├─ status: (HEALTHY) Healthy: communicating with pid '4692'
      ├─ winlog-default
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: OUTPUT
      ├─ winlog-default-winlog-system-f57fba90-d162-43bb-8f6b-546015d84c78
      │  ├─ status: (HEALTHY) Healthy
      │  └─ type: INPUT
      └─ winlog-default-winlog-windows-e715bcd7-f597-4666-9a17-c75be66c9e02
         ├─ status: (HEALTHY) Healthy
         └─ type: INPUT

c:\>

@amolnater-qasource amolnater-qasource added the QA:Ready For Testing Code is merged and ready for QA to validate label Nov 19, 2024
@amolnater-qasource
Copy link
Author

amolnater-qasource commented Dec 6, 2024

Hi Team,
We have revalidated this issue on latest 8.17.0 BC5 kibana cloud environment and found it fixed now.

Observations:

  • No errors are displayed on adding Windows integration to the Windows agent and agent remains Healthy throughout.

Screenshots:
Image
Image
Image

Logs:

elastic-agent-diagnostics-2024-12-06T04-47-57Z-00 (1).zip

Build details:
VERSION: 8.17.0 BC5
BUILD: 80495
COMMIT: 5c78fb5e4e9b5063bd83ae9bd1e5b32c63f5cc34
Artifact Link: https://staging.elastic.co/8.17.0-a18e6540/downloads/beats/elastic-agent/elastic-agent-8.17.0-windows-x86_64.zip

Hence we are closing and marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added QA:Validated Validated by the QA Team and removed QA:Ready For Testing Code is merged and ready for QA to validate labels Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working impact:medium QA:Validated Validated by the QA Team Team:Security-Windows Platform Team:Security-Windows Platform
Projects
None yet
Development

No branches or pull requests

7 participants