Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change upgrade watcher to use StateWatch #3622

Merged
merged 16 commits into from
Oct 19, 2023

Conversation

blakerouse
Copy link
Contributor

@blakerouse blakerouse commented Oct 17, 2023

What does this PR do?

Refactors how the upgrade watcher works removing the need to parse the PID from any system service manager.

The watcher now tracks the inability to connect to the daemon, lost connections to the daemon, agent failures, component failures, and flip-flopping of healthy/error states. The tracking of the lost connections to the daemon allows the removal of the PID tracking that used to be present. The error tracking is more robust as it requires that an error be an error for at least 30 seconds, allowing the Elastic Agent to possibly have an issue and then recover (before if the State call happened when an error was present it would just become failed).

This also includes unit testing for the AgentWatcher service that previously had no unit tests.

Why is it important?

Completely removes the requirement of executing and reading output of service managers (removing the need for root access). Improves the overall logic and provides better testing.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test (covered by existing tests)

@blakerouse blakerouse added Team:Elastic-Agent Label for the Agent team backport-skip labels Oct 17, 2023
@blakerouse blakerouse self-assigned this Oct 17, 2023
@blakerouse blakerouse marked this pull request as ready for review October 17, 2023 14:55
@blakerouse blakerouse requested a review from a team as a code owner October 17, 2023 14:55
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 17, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-10-18T20:41:10.213+0000

  • Duration: 25 min 25 sec

Test stats 🧪

Test Results
Failed 0
Passed 6513
Skipped 59
Total 6572

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages.

  • run integration tests : Run the Elastic Agent Integration tests.

  • run end-to-end tests : Generate the packages and run the E2E Tests.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Contributor

elasticmachine commented Oct 17, 2023

🌐 Coverage report

Name Metrics % (covered/total) Diff
Packages 98.81% (83/84) 👍
Files 66.997% (203/303) 👎 -0.109
Classes 65.946% (366/555) 👍 0.227
Methods 53.128% (1155/2174) 👍 0.131
Lines 39.542% (13633/34477) 👍 0.28
Conditionals 100.0% (0/0) 💚

@ycombinator
Copy link
Contributor

crash_check.interval: 15s

Should we be removing this line? Or do we need to leave it in there for older Agent versions (that don't have the watcher implementation in this PR)?

case err := <-failedCh:
if err != nil {
if failedErr == nil {
failedCount++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we only increment failedCount when we received a non-nil error from failedCh but previously there was no error (i.e. failedErr == nil). So that means failedCount actually represents the number of times we flip-flopped from a non-error state to an error state, right? If my understanding is correct, could we rename this to something like flipFlopCount or something else indicating that it's counting flip-flops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@blakerouse
Copy link
Contributor Author

crash_check.interval: 15s

Should we be removing this line? Or do we need to leave it in there for older Agent versions (that don't have the watcher implementation in this PR)?

This has to remaining for older versions.

Copy link
Contributor

@ycombinator ycombinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@blakerouse
Copy link
Contributor Author

buildkite test this

@blakerouse blakerouse enabled auto-merge (squash) October 19, 2023 01:05
@blakerouse
Copy link
Contributor Author

another unrelated failure... this time the endpoint integration

@blakerouse
Copy link
Contributor Author

buildkite test this

@elastic-sonarqube
Copy link

@blakerouse blakerouse merged commit d64d704 into elastic:main Oct 19, 2023
8 checks passed
@blakerouse blakerouse deleted the watcher-stream branch October 19, 2023 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants