[Agent] "Metricbeat" service crashed on restarting elastic-agent, want improved beat side logging to inform user #25785

amolnater-qasource · 2021-05-19T11:44:36Z

Kibana version: 7.13.0 BC-7 Kibana self-managed environment

Host OS and Browser version: Windows 10 x64, All

Build Details:

 Artifact link used: https://staging.elastic.co/7.13.0-8eb98cbf/summary-7.13.0.html
 BUILD: 40864
 COMMIT: 6ce6847436ff9bef0ad91268b6585e0f9339c9fd

Preconditions:

7.13.0 BC-7 self-managed Kibana environment should be available.
Windows 10 x64 Fleet Server agent must be installed using Default Fleet server policy having only Fleet Server integration.

Steps to reproduce:

Login to Kibana environment.
Assign "Agent" to Default policy having System, Endpoint and Fleet Server Policy.
Restart elastic-agent from services.
Navigate to agent "Logs" tab and observe Metricbeat service stuck in "Starting-Restarting-Crashed" loop.

Expected Result:
[Self managed]: "Metricbeat" service should be in "Running" state on restarting elastic-agent.

Logs:
logs.zip

Screenshot:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-05-19T11:45:12Z

Pinging @elastic/fleet (Team:Fleet)

amolnater-qasource · 2021-05-19T11:47:35Z

@dikshachauhan-qasource Please review.

dikshachauhan-qasource · 2021-05-19T11:48:54Z

Reviewed and assigned to @EricDavisX

elasticmachine · 2021-05-20T17:31:24Z

Pinging @elastic/agent (Team:Agent)

EricDavisX · 2021-05-20T17:41:03Z

@ph @ruflin this seems like the root cause to the issue I just pinged you on. It also sounds familiar and pretty severe. Is this not a blocker for self-managed where we expect the stack / agents to go on/off line quite a bit?

michalpristas · 2021-05-24T09:59:52Z

cannot reproduce, nor darwin nor windows, used same version.

could i get content of data/elastic-agent-{hash}/logs directory? this would help

michalpristas · 2021-05-24T10:01:24Z

also if possible maybe whole Program Files/Elastic/Agent would be good. not sure if github allows to send that large file, if not please DM me on slack

amolnater-qasource · 2021-05-24T10:36:43Z

Sure @michalpristas as per request I will share all the required details on slack.
Thanks

michalpristas · 2021-05-24T10:54:01Z

there was an orphaned metricbeat process blocking data path, when agent was stopped it was not stopped with agent and continue running eating a lot of RAM (2.5gb)

when i killed this process issue went away.
the problem with restart was that the orphaned metricbeat was using same data path as the one we were trying to start. the new one found out path is already locked and exited with error.

we should find out why orphaned metricbeat was even created and could possibly isolate data paths more. but i think it's good to know you have such a process by seeing agent complaining rather that have multiple beats running and eating resources

EricDavisX · 2021-05-24T12:23:11Z

@michalpristas thanks for the thoughts. do you think we can check for the data path being in use at start up and put more specific error logging into place to help guide users / devs about this particular undesired occurrence? we could use this ticket for that for now. @amolnater-qasource if you were doing regular operations / test steps, lets put some time to figuring out which specific steps caused this and log a new ticket, i expect the description would be 'after doing xyz Agent is shutdown but metricbeat is orphaned and still running'

michalpristas · 2021-05-24T12:55:34Z

this is internal logic of beats and i would like to avoid having checks like this in agent.
we could check periodically or at start for processes running from our path before starting anything and raising it as ERROR WARNING. but i guess this is up to a discussion

cc @urso

amolnater-qasource · 2021-05-24T14:08:18Z

Hi @EricDavisX
As per your above feedback we have reported a ticket with required steps at this #25829

Thanks
QAS

EricDavisX · 2021-05-24T16:33:56Z

I've updated the short desc and modified the labels, it is not urgent for 7.13 to improve the logging - the other ticket logged, noted above is high priority for self-managed usage and is tagged for 7.13.1 review (usual urgent issues review process)

amolnater-qasource · 2021-06-10T10:58:04Z

Hi @EricDavisX
We have revalidated this issue on 7.14.0 self-managed Snapshot Kibana and found it fixed.

Build Details:

Artifact link: https://snapshots.elastic.co/7.14.0-28665d9b/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip
Build: 41559
Commit: 9838db392e7fcfc12f004b68fb1b09739f131148

Steps followed:

Login to self-managed Kibana environment.
Install Agent with Default Fleet Server Policy.
Assign "Agent" to Default policy having System, Endpoint and Fleet Server Integration.
Restart elastic-agent from services.

Observations:
We observed Filebeat and Metricbeat in running state after restart under agent Logs tab.

Screenshot:

Hence closing this out.

Thanks
QAS

amolnater-qasource · 2021-07-08T07:50:55Z

Hi @michalpristas
We have observed an inconsistent Metricbeat Starting, Restarting and Crashed loop just after installing agent on 7.14.0 Snapshot Kibana cloud environment.

However it gets resolved itself after few minutes, agent gets offline and then comeback online.
All the binaries comeback in RUNNING state.

Build details:

Build: 42366
Commit: 22dee04008b9936be37225b97a6456e750d559a7
Artifact Link: https://snapshots.elastic.co/7.14.0-ef1f955b/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip

Logs:
logs.zip

Please let us know if anything else is required from our end.
cc: @EricDavisX
Thanks
QAS

amolnater-qasource assigned ghost May 19, 2021

amolnater-qasource added the bug label May 19, 2021

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 19, 2021

amolnater-qasource added impact:high Short-term priority; add to current release, or definitely next. and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 19, 2021

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 19, 2021

amolnater-qasource added Team:Fleet Label for the Fleet team and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 19, 2021

amolnater-qasource assigned dikshachauhan-qasource and unassigned ghost May 19, 2021

dikshachauhan-qasource assigned EricDavisX and unassigned dikshachauhan-qasource May 19, 2021

EricDavisX mentioned this issue May 20, 2021

[Self managed]: elastic_agent.metricbeat/filebeat datastreams generated on installing fleet-server agent. elastic/fleet-server#376

Closed

EricDavisX removed their assignment May 20, 2021

EricDavisX added the Team:Elastic-Agent Label for the Agent team label May 20, 2021

EricDavisX removed the Team:Fleet Label for the Fleet team label May 20, 2021

michalpristas self-assigned this May 24, 2021

michalpristas added the v7.13.0 label May 24, 2021

EricDavisX removed impact:high Short-term priority; add to current release, or definitely next. v7.13.0 labels May 24, 2021

amolnater-qasource closed this as completed Jun 10, 2021

amolnater-qasource added the v7.14.0 label Jun 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Agent] "Metricbeat" service crashed on restarting elastic-agent, want improved beat side logging to inform user #25785

[Agent] "Metricbeat" service crashed on restarting elastic-agent, want improved beat side logging to inform user #25785

amolnater-qasource commented May 19, 2021 •

edited

Loading

elasticmachine commented May 19, 2021

amolnater-qasource commented May 19, 2021

dikshachauhan-qasource commented May 19, 2021

elasticmachine commented May 20, 2021

EricDavisX commented May 20, 2021

michalpristas commented May 24, 2021

michalpristas commented May 24, 2021

amolnater-qasource commented May 24, 2021

michalpristas commented May 24, 2021

EricDavisX commented May 24, 2021

michalpristas commented May 24, 2021

amolnater-qasource commented May 24, 2021 •

edited

Loading

EricDavisX commented May 24, 2021

amolnater-qasource commented Jun 10, 2021

amolnater-qasource commented Jul 8, 2021

[Agent] "Metricbeat" service crashed on restarting elastic-agent, want improved beat side logging to inform user #25785

[Agent] "Metricbeat" service crashed on restarting elastic-agent, want improved beat side logging to inform user #25785

Comments

amolnater-qasource commented May 19, 2021 • edited Loading

elasticmachine commented May 19, 2021

amolnater-qasource commented May 19, 2021

dikshachauhan-qasource commented May 19, 2021

elasticmachine commented May 20, 2021

EricDavisX commented May 20, 2021

michalpristas commented May 24, 2021

michalpristas commented May 24, 2021

amolnater-qasource commented May 24, 2021

michalpristas commented May 24, 2021

EricDavisX commented May 24, 2021

michalpristas commented May 24, 2021

amolnater-qasource commented May 24, 2021 • edited Loading

EricDavisX commented May 24, 2021

amolnater-qasource commented Jun 10, 2021

amolnater-qasource commented Jul 8, 2021

amolnater-qasource commented May 19, 2021 •

edited

Loading

amolnater-qasource commented May 24, 2021 •

edited

Loading