Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky Test]: TestLongRunningAgentForLeaks/TestHandleLeak – Metricbeat input status reporting makes Windows agent permanently degraded #5300

Closed
Tracked by #40542
rdner opened this issue Aug 14, 2024 · 6 comments
Assignees
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@rdner
Copy link
Member

rdner commented Aug 14, 2024

Failing test case

TestLongRunningAgentForLeaks/TestHandleLeak

Error message

agent isn't healthy, current status: DEGRADED

Build

OS

Linux, Windows

Stacktrace and notes

=== RUN   TestLongRunningAgentForLeaks/TestHandleLeak
    agent_long_running_leak_test.go:226: unit ID: beat/metrics-monitoring
    agent_long_running_leak_test.go:226: unit ID: beat/metrics-monitoring-metrics-monitoring-beats
    agent_long_running_leak_test.go:235: component state: Healthy: communicating with pid '27007'
    agent_long_running_leak_test.go:226: unit ID: filestream-monitoring
    agent_long_running_leak_test.go:226: unit ID: filestream-monitoring-filestream-monitoring-agent
    agent_long_running_leak_test.go:235: component state: Healthy: communicating with pid '27000'
    agent_long_running_leak_test.go:226: unit ID: http/metrics-monitoring
    agent_long_running_leak_test.go:226: unit ID: http/metrics-monitoring-metrics-monitoring-agent
    agent_long_running_leak_test.go:235: component state: Healthy: communicating with pid '27016'
    agent_long_running_leak_test.go:226: unit ID: log-default
    agent_long_running_leak_test.go:226: unit ID: log-default-logfile-apache-a09cc97f-6135-4a1b-894c-3652908819fd
    agent_long_running_leak_test.go:226: unit ID: log-default-logfile-system-68e60f33-cd06-475a-b448-cdb7040a25ee
    agent_long_running_leak_test.go:235: component state: Healthy: communicating with pid '27028'
    agent_long_running_leak_test.go:226: unit ID: system/metrics-default
    agent_long_running_leak_test.go:226: unit ID: system/metrics-default-system/metrics-system-68e60f33-cd06-475a-b448-cdb7040a25ee
    agent_long_running_leak_test.go:235: component state: Healthy: communicating with pid '27034'
    agent_long_running_leak_test.go:374: created handle watcher for beat/metrics (27007)
    agent_long_running_leak_test.go:374: created handle watcher for filestream (27000)
    agent_long_running_leak_test.go:374: created handle watcher for http/metrics (27016)
    agent_long_running_leak_test.go:374: created handle watcher for log (27028)
    agent_long_running_leak_test.go:374: created handle watcher for system/metrics (27034)
    agent_long_running_leak_test.go:164: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/agent_long_running_leak_test.go:164
        	Error:      	Received unexpected error:
        	            	agent isn't healthy, current status: DEGRADED
        	Test:       	TestLongRunningAgentForLeaks/TestHandleLeak
--- FAIL: TestLongRunningAgentForLeaks/TestHandleLeak (30.92s)
@rdner rdner added Team:Elastic-Agent Label for the Agent team flaky-test Unstable or unreliable test cases. labels Aug 14, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz
Copy link
Member

cmacknz commented Aug 14, 2024

This is the regression test for elastic/beats#37142 so we need to get this working again as soon as we can.

As long as it is failing this way, the test isn't testing what it is supposed to so I'm going to skip it to give us time to sort that out.

@cmacknz
Copy link
Member

cmacknz commented Aug 14, 2024

I am going to relax the check in the test to allow the degraded state while we figure out how to handle it. #5301

We do not want all Windows Metricbeat instances to report degraded by default. There is nothing unique about this test (except perhaps the increased metrics collection interval).

@cmacknz cmacknz changed the title [Flaky Test]: TestLongRunningAgentForLeaks/TestHandleLeak – agent isn't healthy, current status: DEGRADED [Flaky Test]: TestLongRunningAgentForLeaks/TestHandleLeak – Metricbeat input status reporting makes Windows agent permanently degraded Aug 14, 2024
@cmacknz
Copy link
Member

cmacknz commented Aug 15, 2024

This has been mitigated by #5301 so the test can still detect memory leaks but the underlying problems causing the degraded state still remain.

There appear to be two separate errors happening:

        units:
            input-beat/metrics-monitoring-metrics-monitoring-beats:
                message: 'Error fetching data for metricset beat.stats: error making http request: Get "http://unix/stats": dial unix /opt/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock: connect: no such file or directory'
                payload:
                    streams:
                        metrics-monitoring-filebeat:
                            error: ""
                            status: HEALTHY
                        metrics-monitoring-metricbeat:
                            error: 'Error fetching data for metricset beat.stats: error making http request: Get "http://unix/stats": dial unix /opt/Elastic/Agent/data/tmp/iThI_df0cBKC6YUNGGlKscMkOfz3FBH3.sock: connect: no such file or directory'
                            status: DEGRADED
                state: 3

and

            input-system/metrics-default-system/metrics-system-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                message: |-
                    Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                    error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                    GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                payload:
                    streams:
                        system/metrics-system.process-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                            error: |-
                                Error fetching data for metricset system.process: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                                error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                                GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                                non fatal error fetching PID some info for 116, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 360, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 472, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 556, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 564, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 696, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 4304, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 3108, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 2116, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 1716, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                                non fatal error fetching PID some info for 4856, metrics are valid, but partial: FillMetricsRequiringMoreAccess: error fetching process args: Not enough privileges to fetch information: OpenProcess failed: Access is denied.
                            status: DEGRADED
                        system/metrics-system.process.summary-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                            error: |-
                                Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                                error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                                GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                            status: DEGRADED

@cmacknz
Copy link
Member

cmacknz commented Aug 15, 2024

A way to permanently mitigate the second error is elastic/beats#40542 which would let us revert #5301

@VihasMakwana
Copy link
Contributor

@ycombinator @pierrehilbert this is fixed via elastic/beats#40565.

Passing extended runtime leak tests.

I'm closing this as of now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

5 participants