Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic Agent] Elastic Agent takes long to shut down #29650

Merged
merged 27 commits into from
Feb 14, 2022
Merged

Conversation

AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Dec 30, 2021

What does this PR do?

WIP:

  • The agent shuts down the applications concurrently
  • stop ignoring errors when stopping applications/services
  • set process (x-pack/elastic-agent/pkg/core/plugin/process/app.go) state to failure when it fails to stop
  • fix typos
  • add more logs: erros, info, debug

Why is it important?

It's part of the effort to support the new M1 Macs.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Author's Checklist

  • do the TODOs
  • remove investigation logs
  • remove investigation changes

How to test this PR locally

Related issues

Screenshots

Logs

Before the fix:

[vagrant@archlinux elastic-agent-7.16.3-linux-x86_64]$ time sudo systemctl stop elastic-agent.service

real	1m30.173s
user	0m0.009s
sys	0m0.009s
[vagrant@archlinux elastic-agent-7.16.3-linux-x86_64]$ 

{"log.level":"info","@timestamp":"2022-01-24T16:01:15.460Z","log.origin":{"file.name":"cmd/run.go","file.line":184},"message":"Shutting down Elastic Agent and sending last events...","ecs.version":"1.6.0"}

[...]

{"log.level":"info","@timestamp":"2022-01-24T16:01:47.452Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-01-24T16:01:47Z - message: Application: filebeat--7.16.3--36643631373035623733363936343635[8e074b30-f029-409d-869b-d5fd78181398]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-24T16:01:55.497Z","log.origin":{"file.name":"process/app.go","file.line":176},"message":"Signaling application to stop because of shutdown: metricbeat--7.16.3","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-24T16:01:55.497Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-01-24T16:01:55Z - message: Application: filebeat--7.16.3[8e074b30-f029-409d-869b-d5fd78181398]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-24T16:02:35.547Z","log.origin":{"file.name":"process/app.go","file.line":176},"message":"Signaling application to stop because of shutdown: filebeat--7.16.3--36643631373035623733363936343635","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-24T16:02:35.547Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-01-24T16:02:35Z - message: Application: metricbeat--7.16.3[8e074b30-f029-409d-869b-d5fd78181398]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'","ecs.version":"1.6.0"}

After the fix:

[vagrant@archlinux ~]$ time systemctl stop elastic-agent.service

[vagrant@archlinux ~]$ time systemctl stop elastic-agent.service

real	0m10.535s
user	0m0.008s
sys	0m0.007s
[vagrant@archlinux ~]$ 
{"log.level":"info","@timestamp":"2022-01-26T15:14:44.858Z","log.origin":{"file.name":"cmd/run.go","file.line":189},"message":"Shutting down Elastic Agent and sending last events...","ecs.version":"1.6.0"}

[...]

{"log.level":"error","@timestamp":"2022-01-26T15:14:45.359Z","log.origin":{"file.name":"process/app.go","file.line":290},"message":"failed to stop metricbeat: os: process already finished","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-01-26T15:14:45.359Z","log.origin":{"file.name":"process/app.go","file.line":290},"message":"failed to stop metricbeat_monitoring: os: process already finished","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:46.359Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-01-26T15:14:46Z - message: Application: metricbeat--7.16.3-SNAPSHOT--36643631373035623733363936343635[0c1bba5c-6961-4c23-ae43-f4c8847a27c9]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:46.359Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-01-26T15:14:46Z - message: Application: metricbeat--7.16.3-SNAPSHOT[0c1bba5c-6961-4c23-ae43-f4c8847a27c9]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:55.360Z","log.origin":{"file.name":"process/app.go","file.line":315},"message":"gracefulKill timed out after 10000000000, killing filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:55.360Z","log.origin":{"file.name":"process/app.go","file.line":315},"message":"gracefulKill timed out after 10000000000, killing filebeat_monitoring","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:55.360Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-01-26T15:14:55Z - message: Application: filebeat--7.16.3-SNAPSHOT--36643631373035623733363936343635[0c1bba5c-6961-4c23-ae43-f4c8847a27c9]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:55.360Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2022-01-26T15:14:55Z - message: Application: filebeat--7.16.3-SNAPSHOT[0c1bba5c-6961-4c23-ae43-f4c8847a27c9]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:55.361Z","log.origin":{"file.name":"application/managed_mode.go","file.line":319},"message":"Agent is stopped","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:55.361Z","log.origin":{"file.name":"cmd/run.go","file.line":197},"message":"Shutting down completed.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-01-26T15:14:55.361Z","log.logger":"api","log.origin":{"file.name":"api/server.go","file.line":66},"message":"Stats endpoint (/opt/Elastic/Agent/data/tmp/elastic-agent.sock) finished: accept unix /opt/Elastic/Agent/data/tmp/elastic-agent.sock: use of closed network connection","ecs.version":"1.6.0"}

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 30, 2021
@mergify
Copy link
Contributor

mergify bot commented Dec 30, 2021

This pull request does not have a backport label. Could you fix it @AndersonQ? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 7./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

@mergify mergify bot added the backport-skip Skip notification from the automated backport with mergify label Dec 30, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Dec 30, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-02-11T07:20:32.917+0000

  • Duration: 101 min 28 sec

Test stats 🧪

Test Results
Failed 0
Passed 7232
Skipped 16
Total 7248

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

// Types is the list of package types
var SelectedPackageTypes []PackageType
// SelectedPackageTypes is the list of package types
var SelectedPackageTypes []PackageType = []PackageType{TarGz}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

windows does not have targz. careful and test on windows vm for sure

LDFlags: []string{
"-s", // Strip all debug symbols from binary (does not affect Go stack traces).
// "-s", // Strip all debug symbols from binary (does not affect Go stack traces).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure having debug symbols on prod binaries was intended here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it before the merge. Actually, I think if it could be configurable it'd valuable to have it in development mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a "DEV" flag that exists in other places, it could be used to enable or not symbols?

DevBuild, err = strconv.ParseBool(EnvOr("DEV", "false"))
if err != nil {
panic(errors.Wrap(err, "failed to parse DEV env value"))
}

@@ -279,8 +289,13 @@ func (a *Application) gracefulKill(proc *process.Info) {
return
}

alog := logp.NewLogger("anderson")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to remove this before merge

@michalpristas
Copy link
Contributor

can you add some results from tests. how long it took before vs now after a change

@jlind23
Copy link
Collaborator

jlind23 commented Jan 17, 2022

@ph @AndersonQ could you please update this PR and make it move?

@jlind23 jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Jan 17, 2022
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 17, 2022
@ph
Copy link
Contributor

ph commented Jan 17, 2022

Thanks @AndersonQ

@AndersonQ AndersonQ marked this pull request as ready for review January 17, 2022 16:39
@AndersonQ AndersonQ requested a review from a team as a code owner January 17, 2022 16:39
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, is there enough existing test around shutdown?

LDFlags: []string{
"-s", // Strip all debug symbols from binary (does not affect Go stack traces).
// "-s", // Strip all debug symbols from binary (does not affect Go stack traces).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a "DEV" flag that exists in other places, it could be used to enable or not symbols?

DevBuild, err = strconv.ParseBool(EnvOr("DEV", "false"))
if err != nil {
panic(errors.Wrap(err, "failed to parse DEV env value"))
}

@@ -90,6 +91,8 @@ func run(streams *cli.IOStreams, override cfgOverrider) error {
return err
}

// TODO: remove before review
cfg.Settings.LoggingConfig.Level = logp.DebugLevel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndersonQ Oops, did you need to have the logging in test? while testing you can update the configuration to have it enable?

@@ -280,7 +286,9 @@ func (a *Application) gracefulKill(proc *process.Info) {
}

// send stop signal to request stop
proc.Stop()
if err := proc.Stop(); err != nil {
a.logger.Error(fmt.Errorf("failed to stop %s: %w", a.Name(), err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling fmt.Errorf( you can call a.logger.Errorf(

@@ -275,6 +278,7 @@ func (a *Application) Stop() {
func (a *Application) Shutdown() {
a.appLock.Lock()
defer a.appLock.Unlock()
a.logger.Infof("Signaling service to stop because of shutdown: %s", a.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/Signaling/signaling Don't use capitalization when starting a logging statement.

@@ -294,6 +298,7 @@ func (a *Application) Shutdown() {
//
// It updates the status of the application and handles restarting the application is needed.
func (a *Application) OnStatusChange(s *server.ApplicationState, status proto.StateObserved_Status, msg string, payload map[string]interface{}) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra space?

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have used request changes.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to fix the lock

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets split build changes vs fixing the shutdow issue.

// Strip all debug symbols from binary (does not affect Go stack traces).
args.LDFlags = append(args.LDFlags, "-s")
}

return args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndersonQ Could you move this into his own pull request? Since this changes is not required for solving the "takes long to shutdown".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: #29961

@AndersonQ AndersonQ self-assigned this Jan 24, 2022
Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a changelog entry?

@AndersonQ AndersonQ added the bug label Jan 28, 2022
@AndersonQ
Copy link
Member Author

@ph, @blakerouse could you re-review it please?

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndersonQ Changes LGTM, is there enough test coverage around that area of code? Can you investigate that? That the only thing that is preventing me from approving it.

x-pack/elastic-agent/CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved
@AndersonQ
Copy link
Member Author

AndersonQ commented Feb 1, 2022

@AndersonQ Changes LGTM, is there enough test coverage around that area of code? Can you investigate that? That the only thing that is preventing me from approving it.

not for the shutdown function. the coverage on the package isn't that great:

ok  	github.com/elastic/beats/v7/x-pack/elastic-agent/pkg/agent/operation	16.567s	coverage: 50.2% of statements

Well, I could do a manual test, compare the elastic and beats logs when shutting down to check that they shutdown successfully

@AndersonQ AndersonQ requested a review from ph February 1, 2022 16:25
@@ -120,6 +125,8 @@ func TestExecFile_StdOut(t *testing.T) {
}

func TestExecFile_NoOutput(t *testing.T) {
t.Skip("skipping failing tests")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndersonQ is this new? I haven't seen that in the original PR commits?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a mistake, the're failing on master as well... let me fix it

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, lock is fixed! Thanks.

@ph
Copy link
Contributor

ph commented Feb 2, 2022

Change LGTM thanks !

@AndersonQ
Copy link
Member Author

/package

@jlind23
Copy link
Collaborator

jlind23 commented Feb 10, 2022

/test

@jlind23
Copy link
Collaborator

jlind23 commented Feb 10, 2022

@AndersonQ @ph the only issue left in the e2e testing repo is about the k8s autodiscover. Can we merge this PR then?

@ph
Copy link
Contributor

ph commented Feb 10, 2022

@jlind23 LGTM, this would be 8.0 8.1 8.2?

@mergify
Copy link
Contributor

mergify bot commented Feb 10, 2022

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b 27530-long-shutdown upstream/27530-long-shutdown
git merge upstream/main
git push upstream 27530-long-shutdown

@jlind23
Copy link
Collaborator

jlind23 commented Feb 11, 2022

I would rather keep it in 8.2 only to avoid any side effect.

@ph ph added backport-v8.2.0 Automated backport with mergify and removed backport-skip Skip notification from the automated backport with mergify labels Feb 11, 2022
@AndersonQ AndersonQ merged commit 952c2eb into main Feb 14, 2022
@AndersonQ AndersonQ deleted the 27530-long-shutdown branch February 14, 2022 09:25
v1v added a commit to v1v/beats that referenced this pull request Feb 21, 2022
…into feature/use-with-kind-k8s-env

* 'feature/use-with-kind-k8s-env' of github.com:v1v/beats: (52 commits)
  ci: home is declared within withBeatsEnv
  ci: use withKindEnv step
  ci: use getBranchesFromAliases and support next-patch-8 (elastic#30400)
  Update fields.yml (elastic#29609)
  Heartbeat: fix browser metrics and trace mappings (elastic#30258)
  Apply light edits to 8.0 changelog (elastic#30351)
  packetbeat/beater: make sure Npcap installation runs before interfaces are needed (elastic#30396)
  Add a ring-buffer reporter to libbeat (elastic#28750)
  Osquerybeat: Add install verification for osquerybeat (elastic#30388)
  update windows matrix support (elastic#30373)
  Refactor of metricbeat process-gathering metrics and system/process (elastic#30076)
  adjust next changelog wording (elastic#30371)
  [Metricbeat] azure: move event report into loop validDim loop (elastic#29945)
  fix: report GitHub Check before the cache (elastic#30372)
  Add support for non-unique keys in Kafka output headers (elastic#30369)
  ci: 6 major branch reached EOL (elastic#30357)
  reduce Elastic Agent shut down time by stopping processes concurrently (elastic#29650)
  [Filebeat] Add message to register encode/decode debug logs (elastic#30271)
  [libbeat] kafka message header support (elastic#29940)
  Heartbeat: set duration to zero for syntax errors (elastic#30227)
  ...
@amolnater-qasource
Copy link

amolnater-qasource commented Feb 28, 2022

Hi Team
Test results on 8.1 BC-4 for this PR has been shared at #27530 (comment)

UPDATE:
We will be validating the same on 8.2 Snapshot.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.2.0 Automated backport with mergify bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Elastic Agent] Elastic Agent takes long to shut down
7 participants