Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ingest-Management]: "Enable elastic security agent" page instead of host appears under "Administrator>Host" tab, when user first forcefully un-enroll the agent and then re-enrolled the agent from Fleet tab. #73272

Closed
ghost opened this issue Jul 27, 2020 · 22 comments
Labels
bug Fixes for quality problems that affect the customer experience Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@ghost
Copy link

ghost commented Jul 27, 2020

Kibana version:
Kibana: 7.9 BC4

Elasticsearch version:
Elasticsearch: 7.9 BC4

Agent version:
Agent: 7.9 BC4

Browser version:
Windows 10, Chrome

Original install method (e.g. download page, yum, from source, etc.):
From 7.9 BC4

Description
[Ingest-Management]: "Enable elastic security agent" page instead of host appears under "Administrator>Host" tab, when user first forcefully un-enroll the agent and then re-enrolled the agent from Fleet tab.

Preconditions

  1. Kibana 7.9 BC4 cloud environment should be available.
  2. Agent should be enrolled under Fleet tab from endpoint security app [Note that default config now integrated with endpoint security]

Steps to Reproduce

  1. Open the Kibana 7.9 BC4 cloud environment in browser, then click Ingest Manager>Fleet tab.
  2. Click "Action>Unenroll" option next to enrolled agent and then Force unenrolled the same.
  3. Notice that agent moves to Inactive state.
  4. Now re-enrolled the same agent with default config by running enrollment string having token from Fleet section.
  5. Observe that agent enrolled successfully under Fleet tab.
  6. Navigate to "Endpoint security>Administrator>Host" tab.
  7. Observe that "Enable elastic security agent" page instead of host appears under "Administrator>Host" tab

Test data
N/A

Impacted Test case id
N/A

Actual Result
"Enable elastic security agent" page instead of host appears under "Administrator>Host" tab, when user first forcefully un-enroll the agent and then re-enrolled the agent from Fleet tab.

Expected Result
Host with Online status should appear under "Administrator>Host" tab, when user first forcefully un-enroll the agent and then re-enrolled the agent from Fleet tab.

What's working
N/A

What's not working
N/A

Screenshot
Endpointsecurityagent

Logs
N/A

@ghost ghost self-assigned this Jul 27, 2020
@ghost
Copy link
Author

ghost commented Jul 27, 2020

Please review the defect @rahulgupta-qasource

@ghost ghost added bug Fixes for quality problems that affect the customer experience impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. Team:Fleet Team label for Observability Data Collection Fleet team labels Jul 27, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@ghost ghost added the Feature:Fleet Fleet team's agent central management project label Jul 27, 2020
@ghost
Copy link

ghost commented Jul 27, 2020

Reviewed and assigned to @EricDavisX

@ghost ghost added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Jul 27, 2020
@ghost ghost assigned EricDavisX Jul 27, 2020
@ghost ghost removed the impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. label Jul 27, 2020
@ph ph assigned ph and unassigned ph Jul 27, 2020
@EricDavisX
Copy link
Contributor

I'm sorry this sat idle for so many days - can you re-test on BC 6 (not BC 7) please? Specific fixes for unenrolling were in BC5 and 6 that I hope help this. If it still is evidenced, please provide the browser dev console output to see what calls are made and if any had errors or strange responses in some form.

@EricDavisX
Copy link
Contributor

@rahulgupta-qasource can you take the re-test on this if you have time?

@EricDavisX EricDavisX removed their assignment Aug 10, 2020
@EricDavisX EricDavisX removed the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Aug 10, 2020
@EricDavisX
Copy link
Contributor

please re-assign me back when action is back on my side. I'm also removing the impact:high label, this feels more moderate to me, if it still amounts to Endpoint being installed.

Also reviewing more.... I think maybe @kevinlog should review the screenshots, I think this might be on the Security App work-flow side. Can you poke in please?

@kevinlog
Copy link
Contributor

@rahulgupta-qasource this will happen when the Endpoint hasn't sent any documents to ES most likely. Can you verify that the Endpoint is successfully stood up and communicating with ES in this scenario? I'll run through the scenario myself as well to see.

FYI @EricDavisX

@kevinlog
Copy link
Contributor

kevinlog commented Aug 10, 2020

@EricDavisX @rahulgupta-qasource here's my test:

BC8 Stack and Agent/Endpoint

  1. Initial Enroll, Agent and Endpoint stand up as expected

Agent
image

Agent logs
image

Endpoint
image

  1. Unenroll and then forcefully unenroll:

Agent inactive
image

Endpoint gone (shows onboarding screen - potentially confusing, but expected for now):
image

  1. When trying to re-enroll, I noticed I was unable to because the config yml is still in use:

image

I think this is because the Endpoint is still running (I assume because the forced-uenroll didn't fully send all correct messages)

image

What has your experience been when re-enrolling after a forced un-enroll? Have you seen this case?

@EricDavisX
Copy link
Contributor

I think it may genuinely just take a full minute or more for Endpoint to finish un-installing and deleting files - so your results may be expected. If we had any time frame estimates to cite between when you did the unenroll and then the re-enroll attempt it could help? But this is good evidence I think its working.

@kevinlog
Copy link
Contributor

kevinlog commented Aug 10, 2020

@EricDavisX thanks for the insight.

I went ahead and manually uninstalled the Endpoint, re-enrolled the Agent and everything is back up and running again.

image

image

@kevinlog
Copy link
Contributor

kevinlog commented Aug 10, 2020

@EricDavisX @rahulgupta-qasource after running through unenroll + force unenroll again, I'm seeing that the Endpoint is not being stopped after 15 or so minutes. I'm not sure what's the expected behavior here.

Note that I just ran the Agent from the cmd, I didn't install the service on Windows. I'm not sure if that makes a difference in force unenroll

FYI @ph @ruflin @blakerouse

@ruflin
Copy link
Contributor

ruflin commented Aug 11, 2020

@kevinlog Do you have by chance any log files from Agent / Endpoint to see what is happening there?

@kevinlog
Copy link
Contributor

@ruflin here are the Endpoint logs.
endpoint-000000.log

Here are the Agent logs (I zipped the entire folder)
logs.zip

I'm not seeing anywhere in the Endpoint logs of receiving a "stop", etc. Although, I'm not quite sure what that would look like. FYI @ferullo

@kevinlog
Copy link
Contributor

@EricDavisX @ruflin sorry to spam you - but as I was collecting the logs above, the Endpoint did finally stop running, but it took about 30 min after I force unenrolled the Agent. So it seems like it is working, it just takes significantly longer than when you unenroll normally.

image

@EricDavisX
Copy link
Contributor

EricDavisX commented Aug 11, 2020

Endpoint stopping after 30 mins seems like an Endpoint side feature, that it hadn't heard from Agent in 30 mins so it shut itself down. With the logs, we can hopefully track what Agent did and didn't send prior to that we might know where there may be an Agent/Endpoint integration bug

@ferullo
Copy link
Contributor

ferullo commented Aug 11, 2020

Endpoint stopping after 30 mins seems like an Endpoint side feature, that it hadn't heard from Agent in 30 mins so it shut itself down

Endpoint does not have this feature.

@gogochan can you help with any Endpoint coordination needed for this.

@ruflin
Copy link
Contributor

ruflin commented Aug 12, 2020

@michalpristas @blakerouse Would be great to get your eyes on this when you are back (both are out at the moment).
@gogochan Let us know what you find.

@gogochan
Copy link

gogochan commented Aug 12, 2020

Seems like Endpoint is not able to populate document on Elasticsearch as @kevinlog described. I see 401 in the Endpoint log

{"@timestamp":"2020-08-12T16:00:14.578506194Z","agent":{"id":"01375897-7434-42ed-b071-edde0d00199b","type":"endpoint"},"ecs":{"version":"1.5.0"},"log":{"level":"error","origin":{"file":{"line":243,"name":"Client.cpp"}}},"message":"Client.cpp:243 HTTP Status Code (401): {\"error\":{\"header\":{\"WWW-Authenticate\":[\"Bearer realm=\\\"security\\\"\",\"ApiKey\",\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\"]},\"reason\":\"missing authentication credentials for REST request [/_cluster/health]\",\"root_cause\":[{\"header\":{\"WWW-Authenticate\":[\"Bearer realm=\\\"security\\\"\",\"ApiKey\",\"Basic realm=\\\"security\\\" charset=\\\"UTF-8\\\"\"]},\"reason\":\"missing authentication credentials for REST request [/_cluster/health]\",\"type\":\"security_exception\"}],\"type\":\"security_exception\"},\"status\":401}","process":{"pid":95533,"thread":{"id":95538}}}

When a user clicks on unenroll and then does force unenroll. The Agent remains running along with ElasticEndpoint, and Beats.

If a user comes back to the machine and re-enrolls the Agent, I suppose this process terminates the Agent from the previous enrollment, but it leaves Elastic Endpoint untouched.

I think this is where we have a potential problem. The token Endpoint received from the previous Agent is no longer valid, it needs to be reloaded.

@gogochan
Copy link

gogochan commented Aug 12, 2020

Further investigation shows that upon force unenroll, API token becomes invalid and Endpoint cannot create the necessary index on the Elasticsearch.

It was observed that Elastic Agent didn't send the new API token to Elastic Endpoint even after re-enroll, leaving Elastic Endpoint with old invalid API token.

A work around is to trigger rev number change by modifying the configuration from the Fleet.

@ghost ghost removed their assignment Aug 14, 2020
@EricDavisX
Copy link
Contributor

I'm so pleased the team persisted and we found the bug to fix! Excellent work folks.

@kamalpreetpahwa-qasource @rahulgupta-qasource I think we should add some new content to the regression suite, I think there is actually a much larger matrix of state changes to cover than I realized.

I'd like to review with @gogochan @kevinlog and @blakerouse to see what we have automated and what we need to cover better manually until we have more automation around this. The 'timing' of when the user unenrolls and then possibly 'too quickly' clicks the force-unenroll is challenging as we don't have much insight into it. I'd like to get some help drawing out a nicer state diagram to track what test cases there are, to start. something like the below (but better):

test content that covers a few scenarios, all starting from known working happy Endpoint/Agent state, as:

  1. setting up agent/endpoint and re-starting agent (without reenrolling) and validate
  2. setting up agent/endpoint, unenrolling Agent and re-enrolling agent with the same config and starting and validate
  3. setting up agent/endpoint, unenrolling Agent and and re-enrolling agent with a new config in same folder and starting.
  4. setting up agent/endpoint, unenrolling Agent and and re-enrolling agent with a new config in a new folder and starting.
  5. repeating 2-4 with a 'force unenroll' directly after the (standard) unenroll call
  6. repeating 1-5 for Windows, and Linux and macOS
  • do we think that this is warranted to be tested on the major OS types or is the logic at all abstracted from that and its overkill? If we can prove it with code knowledge we can save lots of time in the future testing and automating around it.

@ghost
Copy link

ghost commented Aug 17, 2020

Hi @EricDavisX

Thank you for sharing the feedback.

We have validated this ticket and above mentioned scenarios on Windows 10, Linux 'CentOS 7' VM and Mac Mojave 10.14.1 on Kibana BC9 cloud environment and found it fixed.

Executed below steps to validate the ticket:

  1. Navigate to Security app and Enroll Fleet Agent with 'Default config'. Note that 'Elastic Endpoint Security' integration is now added under 'Default config'.
  2. Navigate to Ingest Manager->Fleet tab and Unenroll the agent.
  3. Now, Force-unenroll the agent and observe that agent moves to inactive state.
  4. Now re-enroll the same agent with 'Default config' by running enrollment string from Fleet section.
  5. After agent gets re-enrolled successfully under Fleet tab, wait for some time(say 15-20 minutes) to let Endpoint send documents to ES (as per @kevinlog comment #73272 comment)
  6. Navigate to "Security->Administration" tab.

Observation:
Observed that Host with Online status is displayed on navigating to "Security->Administration" tab after unenroll, force unenroll and re-enrolling the agent with Elastic Endpoint Security integration.

Screenshot:
#73272_Fix

Moreover, we have created 21 testcases for above mentioned scenarios(07 each for Windows, Linux and Mac) and passed them under Agent status on Unenroll, Force unenroll , Re-enroll and restarting TestRun.

Hence, we are closing this bug

@ghost ghost closed this as completed Aug 17, 2020
@ghost
Copy link

ghost commented Aug 19, 2020

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

7 participants