Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Agent does not open bootstrap port for Elastic Endpoint after reboot #21424

Closed
gogochan opened this issue Sep 30, 2020 · 20 comments
Closed
Assignees
Labels
Agent bug failed-test indicates a failed automation test relates v8.0.0

Comments

@gogochan
Copy link

Elastic Agent does not serve GRPC bootstrap info over TCP 6788 after reboot, or restart. This results in Policy failure for the Endpoint because it cannot establish connection

For confirmed bugs, please report:

  1. Create a Policy on Fleet with Elastic Security.
  2. Deploy an Elastic Agent with the policy created from step 1 to a Windows 10 Machine. We used https://snapshots.elastic.co/8.0.0-7e40d4e0/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-windows-x86_64.zip
  3. Observe that port 6788 is open NETSTAT.EXE -an | grep 6788 from a Command.exe
  4. Wait until Elastic Endpoint is installed and is started
  5. Reboot the Windows machine
  6. Notice that Agent is no longer listening on port 6788 and Policy is failed due to Agent connectivity.
    image
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Sep 30, 2020
@gogochan
Copy link
Author

FYI @paul-tavares

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ingest-management (Team:Ingest Management)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Sep 30, 2020
@ph
Copy link
Contributor

ph commented Sep 30, 2020

@gogochan is elastic agent started? how it was installed? @EricDavisX I think we have covered that case in our "install/uninstall" test cases?

@gogochan
Copy link
Author

@ph yes, it did run and makes connection to the Fleet. However it didn't open port 6788, until we make modification to the configuration.

@ph
Copy link
Contributor

ph commented Sep 30, 2020

OK, so it's installed, it's come back up when the computer restart. But it doesn't open the port. This is odd because, I presume the local configuration "persisted" to disk by the Elastic Agent would tell him to have endpoint running.

@blakerouse Can you take a look?

@ph ph assigned blakerouse and unassigned EricDavisX Sep 30, 2020
@blakerouse
Copy link
Contributor

I think this could be because there was an issue with Dynamic Inputs that broke saving the inputs into the action_store.yml. With no inputs in the action_store.yml then on restart the Elastic Agent would not think Endpoint should be running so port 6788 would not be open.

This was fixed in #21298. Can you confirm that your build includes that PR?

@EricDavisX
Copy link
Contributor

@ph to your question, we have re-start of agent tested in e2e-testing, but it doesn't have endpoint in the config at the time. bummer. but, we can do that - should it be a separate test tho? or should we update all relevant test cases to also have Endpoint enabled in policy and test for the related expectations? If its the latter it will have impact on the nomenclature and layout of the tests (I bet the Robots team will insist on keeping it in top-notch logical layout). The one-off change is easily doable, it just needs someone's time to get done, too.

here is the line: https://github.com/elastic/e2e-testing/blob/master/e2e/_suites/ingest-manager/features/fleet_mode_agent.feature#L37

I've logged this ticket for us to improve this with priority:
elastic/e2e-testing#336

@ph
Copy link
Contributor

ph commented Sep 30, 2020

@EricDavisX Yes probably a separate test seems the simplest route?

@blakerouse
Copy link
Contributor

Is this still an issue or was it fixed by #21298?

@gogochan
Copy link
Author

gogochan commented Oct 5, 2020

I am not able to validate this as I cannot spawn an instance of 8.0 at the moment.

@gogochan
Copy link
Author

gogochan commented Oct 6, 2020

Validated using the latest_snapshot https://snapshots.elastic.co/8.0.0-af3cda3c/downloads/beats/elastic-agent/elastic-agent-8.0.0-SNAPSHOT-windows-x86_64.zip

Other than Agent doesn't restart on system reboot, if started manually, the Agent does serve GRPC bootstrap information via 6788

Closing this.

@gogochan gogochan closed this as completed Oct 6, 2020
@EricDavisX EricDavisX added the failed-test indicates a failed automation test relates label Oct 6, 2020
@EricDavisX EricDavisX reopened this Oct 6, 2020
@EricDavisX
Copy link
Contributor

I don't doubt that the manual start of Endpoint works as Chan notes - but I am seeing this with a linux endpoint on latest code. The host has Endpoint up and running and after it is re-booted it throws errors in the Agent log and has 'Agent Connectivity' errors in the Endpoint side log:

Screen Shot 2020-10-06 at 4 01 09 PM

Screen Shot 2020-10-06 at 3 59 35 PM

I think we urgently need to pair program this on Agent + Endpoint. @ph and @ferullo - @gogochan and @blakerouse are you free to find an environment and check it out? I have one now and can give logs if helpful...

@EricDavisX
Copy link
Contributor

logs attached
agent-endpoint-reboot-test.txt

hash of agent is: af3cda3c from today’s *just finished build artifacts here
Kibana info is$ git show -s 6f983728d7f8c2cf065a6d5099157a5cfdc3cd08
Date: Tue Oct 6 09:46:56 2020 +0300

installed it with 'install' command and while it took 5 mins for it to come on line (not reflected in logs) it did eventually and then gave the 1 minute check-in calls successfully. i wanted to see that before I rebooted it.

@blakerouse
Copy link
Contributor

@EricDavisX Based on the error reported in the screenshot, it seems that you might actually have 2 Elastic Agents installed and running?

Are you sure you don't have both the *.deb and the install based installation running? From the log message it shows bind: address is already in use. So that means that Elastic Agent could not open that port for Elastic Endpoint to connect back.

@EricDavisX
Copy link
Contributor

EricDavisX commented Oct 6, 2020

it was online happily before I ran it. and i captured the ps ax output as such:
[zeus@mainqa-atlcolo-10-0-6-147 elastic-agent-8.0.0-SNAPSHOT-linux-x86_64]$ ps ax | grep elastic
23306 ? Ssl 0:03 elastic-agent
23419 ? Ssl 0:01 /opt/Elastic/Endpoint/elastic-endpoint run
23463 pts/0 S+ 0:00 grep --color=auto elastic

lets pair up and figure out and post back what we find. its a long enough thread already, lol

@EricDavisX
Copy link
Contributor

we decided we think this is fixed, and are going to re-test with a full 'regular' build of snapshot that has the needed .asc files tomorrow.

@blakerouse
Copy link
Contributor

@EricDavisX have you been able to test 7.10.0 BC1 to see if this is fixed? I have not been able to reproduce it on my Windows testing.

@EricDavisX
Copy link
Contributor

have not tested 7.10 BC yet - I saw a good report that 7.10 Agent tests were all passing tho... so its likely fixed. let me review the specific vm / test case later on 7.10 and 8.0 both I guess, and we can close it out.

@ph ph assigned EricDavisX and unassigned blakerouse Oct 13, 2020
@ph
Copy link
Contributor

ph commented Oct 13, 2020

reassigned to @EricDavisX send it back to us if its still an issue.

@EricDavisX
Copy link
Contributor

EricDavisX commented Oct 13, 2020

it is not reproducible as noted here with the 7.10 BC1 build of Agent - i have other issues, but this is fixed. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agent bug failed-test indicates a failed automation test relates v8.0.0
Projects
None yet
Development

No branches or pull requests

6 participants