Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent] "Metricbeat" service crashed on restarting elastic-agent, want improved beat side logging to inform user #25785

Closed
amolnater-qasource opened this issue May 19, 2021 · 15 comments
Assignees
Labels
bug Team:Elastic-Agent Label for the Agent team v7.14.0

Comments

@amolnater-qasource
Copy link

amolnater-qasource commented May 19, 2021

Kibana version: 7.13.0 BC-7 Kibana self-managed environment

Host OS and Browser version: Windows 10 x64, All

Build Details:

 Artifact link used: https://staging.elastic.co/7.13.0-8eb98cbf/summary-7.13.0.html
 BUILD: 40864
 COMMIT: 6ce6847436ff9bef0ad91268b6585e0f9339c9fd

Preconditions:

  1. 7.13.0 BC-7 self-managed Kibana environment should be available.
  2. Windows 10 x64 Fleet Server agent must be installed using Default Fleet server policy having only Fleet Server integration.

Steps to reproduce:

  1. Login to Kibana environment.
  2. Assign "Agent" to Default policy having System, Endpoint and Fleet Server Policy.
  3. Restart elastic-agent from services.
  4. Navigate to agent "Logs" tab and observe Metricbeat service stuck in "Starting-Restarting-Crashed" loop.

Expected Result:
[Self managed]: "Metricbeat" service should be in "Running" state on restarting elastic-agent.

Logs:
logs.zip

Screenshot:

5

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 19, 2021
@amolnater-qasource amolnater-qasource added impact:high Short-term priority; add to current release, or definitely next. and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 19, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 19, 2021
@amolnater-qasource amolnater-qasource added Team:Fleet Label for the Fleet team and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 19, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/fleet (Team:Fleet)

@amolnater-qasource
Copy link
Author

@dikshachauhan-qasource Please review.

@dikshachauhan-qasource
Copy link

Reviewed and assigned to @EricDavisX

@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@EricDavisX EricDavisX removed the Team:Fleet Label for the Fleet team label May 20, 2021
@EricDavisX
Copy link
Contributor

@ph @ruflin this seems like the root cause to the issue I just pinged you on. It also sounds familiar and pretty severe. Is this not a blocker for self-managed where we expect the stack / agents to go on/off line quite a bit?

@michalpristas
Copy link
Contributor

cannot reproduce, nor darwin nor windows, used same version.

could i get content of data/elastic-agent-{hash}/logs directory? this would help

@michalpristas
Copy link
Contributor

also if possible maybe whole Program Files/Elastic/Agent would be good. not sure if github allows to send that large file, if not please DM me on slack

@amolnater-qasource
Copy link
Author

Sure @michalpristas as per request I will share all the required details on slack.
Thanks

@michalpristas
Copy link
Contributor

there was an orphaned metricbeat process blocking data path, when agent was stopped it was not stopped with agent and continue running eating a lot of RAM (2.5gb)

when i killed this process issue went away.
the problem with restart was that the orphaned metricbeat was using same data path as the one we were trying to start. the new one found out path is already locked and exited with error.

we should find out why orphaned metricbeat was even created and could possibly isolate data paths more. but i think it's good to know you have such a process by seeing agent complaining rather that have multiple beats running and eating resources

@EricDavisX
Copy link
Contributor

@michalpristas thanks for the thoughts. do you think we can check for the data path being in use at start up and put more specific error logging into place to help guide users / devs about this particular undesired occurrence? we could use this ticket for that for now. @amolnater-qasource if you were doing regular operations / test steps, lets put some time to figuring out which specific steps caused this and log a new ticket, i expect the description would be 'after doing xyz Agent is shutdown but metricbeat is orphaned and still running'

@michalpristas
Copy link
Contributor

this is internal logic of beats and i would like to avoid having checks like this in agent.
we could check periodically or at start for processes running from our path before starting anything and raising it as ERROR WARNING. but i guess this is up to a discussion

cc @urso

@amolnater-qasource
Copy link
Author

amolnater-qasource commented May 24, 2021

Hi @EricDavisX
As per your above feedback we have reported a ticket with required steps at this #25829

Thanks
QAS

@EricDavisX EricDavisX changed the title [Self managed]: "Metricbeat" service crashed on restarting elastic-agent when assigned to Default policy having System, Endpoint and Fleet server integration. [Agent] "Metricbeat" service crashed on restarting elastic-agent, want improved beat side logging to inform user May 24, 2021
@EricDavisX EricDavisX removed impact:high Short-term priority; add to current release, or definitely next. v7.13.0 labels May 24, 2021
@EricDavisX
Copy link
Contributor

I've updated the short desc and modified the labels, it is not urgent for 7.13 to improve the logging - the other ticket logged, noted above is high priority for self-managed usage and is tagged for 7.13.1 review (usual urgent issues review process)

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have revalidated this issue on 7.14.0 self-managed Snapshot Kibana and found it fixed.

Build Details:

Artifact link: https://snapshots.elastic.co/7.14.0-28665d9b/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip
Build: 41559
Commit: 9838db392e7fcfc12f004b68fb1b09739f131148

Steps followed:

  1. Login to self-managed Kibana environment.
  2. Install Agent with Default Fleet Server Policy.
  3. Assign "Agent" to Default policy having System, Endpoint and Fleet Server Integration.
  4. Restart elastic-agent from services.

Observations:
We observed Filebeat and Metricbeat in running state after restart under agent Logs tab.

Screenshot:
6

Hence closing this out.

Thanks
QAS

@amolnater-qasource
Copy link
Author

Hi @michalpristas
We have observed an inconsistent Metricbeat Starting, Restarting and Crashed loop just after installing agent on 7.14.0 Snapshot Kibana cloud environment.

  • However it gets resolved itself after few minutes, agent gets offline and then comeback online.
  • All the binaries comeback in RUNNING state.

Build details:

Build: 42366
Commit: 22dee04008b9936be37225b97a6456e750d559a7
Artifact Link: https://snapshots.elastic.co/7.14.0-ef1f955b/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip

Logs:
logs.zip

Please let us know if anything else is required from our end.
cc: @EricDavisX
Thanks
QAS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent Label for the Agent team v7.14.0
Projects
None yet
Development

No branches or pull requests

5 participants