-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Have Elastic Agent send a final message to its fleet server when making changes #484
Comments
@aarju could you please elaborate on the issue with an example workflow that created the problem for you (and indeed if this is a bug that needs to be addressed). I don't see how say on an enroll we should be sending a "final" message to the current fleet server. there seem to be a lot of duplicate agents in the screenshot attached. Could you explain the workflow that got you there - perhaps that is the issue we need to address. We are in the process of enhancing the status reporting on the agent - including when the integrations have an issue and are not operating in the healthy status. Once this is available the users will be able to build alerts when the status on the agent changes. That should address the actual concern raised here regarding notifications on status change. |
@nimarezainia when agent runs an The workflow that resulted in these duplicate agents came from testing of using jamf to install agent and testing how the jamf script will handle things like agents that are already installed but talking to a different fleet server, agents that are installed but unhealthy, and agents that are running an older version than the currently deployed version. Since elastic-agent deploys our Endpoint Security for workstations we decided that it was more important to guarantee that the agent is running and we can deal with the duplicate agents. Because we are aggressive about making sure a healthy agent is running we are seeing the occasional duplicate agent due to a reinstall or reenroll. From a security point of view the scenario that is most concerning to me is the insider threat or hacker with admin rights using the elastic-agent enroll command to bypass all of our XDR protections and logging. The hacker could enroll the agent in their own fleet server with all protections disabled, do anything they want, and we would never know that it had happened. The jamf script to deploy agent runs every 24h so it would be a considerable blind spot. |
What would be nice to have is:
|
@aarju it is correct that upon a new enrollment - we consider that as a new agent instance and treat it as such. "from a security point of view the scenario that is most concerning to me is the insider threat or hacker with admin rights using the elastic-agent enroll command to bypass all of our XDR protections and logging. The hacker could enroll the agent in their own fleet server with all protections disabled, do anything they want, and we would never know that it had happened. The jamf script to deploy agent runs every 24h so it would be a considerable blind spot." how would the hacker enroll the agent into their own Fleet Server? the bigger issue worth addressing is how does the hacker obtain their own fleet server that is connected to ES/Kibana (they need service tokens, right creds etc). if there's a loop hole here we should address that. |
My big concern is that the fleet is not aware of commands and actions that happen on the agent if those actions cause the agent to stop communicating with fleet. There are a lot talks and Blog posts at security conferences where Red Team and Hackers trade notes on how to disable various XDR/EDR agents in order to cover their tracks. For example, last month in Munich this talk was given about different complex ways to disable EDR after popping a shell on a host. With elastic agent you don't need any of those complex techniques, you just need to enroll it in your own fleet server or ask it nicely to uninstall and the defenders will never know.
When I say their own fleet server I mean the hacker creates their own completely separate Elastic stack that they control and then have elastic agent connect to their stack. They could even create a free trial account in Elastic Cloud to use for their hacks. |
Great idea, @aarju To add some additional context, we ran this exact scenario during a recent ON Week in Protections. In a lab environment, we used a phishing email with a macro-enabled Word document to run the |
@aarju do you still see this issue as a concern given we have tamper protection available on the agent? there should be no unauthorized manipulation of the agent. Currently there's no way to differentiate between agents that where moved from one Fleet to another due to malicious activity and those which have legitimately go offline (say someone going on holidays for 2 weeks). The crux of this issue is to identify agents that are illegitimately uninstalled or enrolled into a different Fleet. Tamper protection is somewhat protecting us from that event. |
There are still scenario's in which an Agent might be deinstalled, even with tamper protection enabled. For example, an IT department might want to enable folks to do a (re)install of Agent for whatever reason (through Jamf or Intune). It would still be nice to know when this happens from a Fleet perspective. I could also imagine that not all our customers can or want to enable tamper protection and would still benefit from these messages. |
This should now be prevented with tamper protection.
Given the compliance requirement, I assume this can't be best effort. That is, if agent is uninstalled, it must notify Fleet Server. It can't send a message which may or may not be processed by Fleet before the uninstall happens. In the case of uninstall, we would have to have a mode that prevents uninstall until Fleet has been notified the uninstall has completed. This would prevent uninstall unless the target machine is online. We have talked within the team about creating a separate watchdog service to monitor agent, like a permanently running instance of our upgrade watcher, but with Fleet connectivity. I think this would be what we'd need to solve this correctly, essentially just making a separate auditbeat instance a mandatory part of the installation. In this type of solution the primary agent process would get uninstalled as requested, but the watchdog wouldn't remove itself until it notified Fleet about what is happening. For unenroll it is a bit easier, as the agent is still going to be running. It can just keep attempting to notify the old cluster in the background. The suggestions around de-duplcating agents re-enrolled to the same cluster are a separate and less complex problem to solve. I'd create a separate issue just for that as the solution there doesn't overlap with the idea of a final message at all. Potentially it can be solved in Fleet if we fingerprint the agent host machine and start keeping a history of what the agent did on that machine or grouping agents we think are the same machine together. |
Are tamper protection events logged anywhere? |
Things have changed slightly since this issue was created. Those duplicates will show offline and then become inactive (user can adjust the timer but I believe the default is 7 days). Inactive agents are not shown in the default view. We have an open issue to create another timer to automatically wipe clean the inactive agents if the user wishes so. |
@nfritts @roxana-gheorghe would you know ^^ ? |
@cmacknz Can the Check-in message be utilized for this function? a general purpose section in the check-in that would require an acknowledgment from Fleet. Most of the time there won't be anything in there. in this case Agent would wait until ack is received from Fleet before proceeding. |
We could maybe reuse the checkin message to do the exchange, but the core problem with sending a final message reliably is the uninstall case. Once uninstalled there is nothing left to do the check in. I don't like blocking uninstall until you check in with Fleet one last time as a solution because it creates the potential for accidentally unremovable agents. There are ways to deal with this, they are just more complicated. |
From what I understand of this issue we want to nudge fleet to unenroll an agent (either with an
If we don't want to block on an uninstall then I think the nudge to fleet-server should make fleet-server insert a If we want to try to get all data sent from the agent, the nudge can result in an I think it's better to introduce a new endpoint for this (such as |
From the description:
It's not about nudging, it's about alerting when an agent is uninstalled or unenrolled, which is required for compliance with certain standards organizations. Since this is tied to compliance, it's can't be best effort, it has to have "at least once" guarantees. The feature can't be "maybe send alert to Fleet" it has to be "always send alert to fleet". This is where the complexity comes from. |
ok, I think then as a start the If the agent is running, do we want to send an If we wanted to add an escape hatch for the user to uninstall the agent with a single command (and not manually remove the agent's install dir and de-register services etc), we can introduce a new flag to the uninstall command (i.e., I'll start working on an RFC for this |
From a compliance perspective this doesn't work. Users that want this feature will not want it to be possible for it to be trivially disabled. One of the use cases here is that the end user wants a notification that a user with root privileges removed agent from their machine. The best example of this would be an Elastic engineer temporarily removing the InfoSec managed agent from their machine. InfoSec will want a notification that this happened. It is probably best to think of this as an optional feature of the agent policy, similar to tamper protection. Potentially it should be part of tamper protection. You will need to handle the case where we want to uninstall, but the network is down. A user should not be able to bypass the audit notification or hang the uninstall of the agent by temporarily turning off wifi on their machine. |
We've started designing how we could guarantee that you get a notification and it isn't simple to do. We were wondering if we could change the solution to instead remove the need to guarantee a notification. @aarju If we supported tamper protection of the elastic-agent process, and tamper protected the enroll command, so that only trusted users could perform the enroll or uninstall operations, would the need for a notification still exist? Within Elastic, this means users with root or admin privileges on their machines would no longer be able to manipulate the InfoSec Elastic Agent at all without the uninstall token. Today we only tamper protect the endpoint-security process. To implement a reliable notification for uninstall or enroll we would likely have to add yet another service whose job it is to perform these notifications, but that service not being tamper protected means a privileged enough user could just stop it if they knew it existed. |
@cmacknz I think that adding tamper protection to the agent process would be a good solution. I still think a 'best effort' final log event letting us know that an enroll or uninstall command was run would be nice, but then it wouldn't have to be 'guaranteed' to meet the regulatory requirements. A process that follows the normal logging path prior to running the enroll or uninstall commands would be good enough. |
Thanks, being able to relax the notification to best effort simplifies the implementation significantly. We should be able to do that along with tamper protecting uninstall+enroll for agent itself. |
Thanks, @michel-laterman, for driving the tech definition for this issue along with inputs from the Endpoint team and other engineers. And thank you for updating this issue's description with a task list for concrete next steps. @nimarezainia I'm re-assigning this issue to you for product prioritization, presumably in consultation with the Endpoint team as there's some work to be done on their end as well. |
@ycombinator I don't think the prioritization changes in this case. This issue as it stands should be addressed in some fashion. Expanding tamper protection was one idea, it seems to be a lot more risky and a lot more involved. Aside from tamper protection, I think a best effort, non-blocking log message to indicate the agent was being uninstalled or unenrolled to a new fleet server would be a good approach here. let me know if that makes sense. |
@intxgo and I had a meeting to discuss concerns about the proposal. |
This issue is quite old. I totally agree with addressing the problem, but not necessarily agree with proposed solution as since then a lot has changed. To begin with it seems the problem does not exist when Agent is unenrolled from Kibana, only when that's made on the endpoint from cmd. Here we have the following situations:
Case 1. Case 2. Case 3.
should always check with Endpoint first (regardless whether Nonetheless, Agent is completely vulnerable to admin, so we can easily have a situation when it's gone, removed by malicious actor, leaving Endpoint orphaned (with no link to Kibana). In this case, the malicious actor should not be able to tamper with Endpoint's config due to policy signature. Their fake stack won't have the same key to sign the policy and/or actions. In short, Endpoint itself should not be taken over, we should address any security bug here. Even right now, if Agent gets abruptly re-enrolled to different stack, Endpoint keeps protecting the host with last config as the new stack can't override the policy through Agent. I agree that the status SummaryIn my opinion, notification about disconnecting Agent from current fleet from command line, by uninstall, enroll, etc, is a nice addition but does not require special *hacks as Agent is totally vulnerable to admin. Tamper Protection ensures Endpoint can't be uninstalled nor trivially taken over by admin, but Agent is a weak link in the chain, so there should be a backup direct channel between Fleet <-> Endpoint to show in Kibana that some Endpoints are still enrolled to the current fleet but are detached. `*hacks - because Endpoint installer doesn't see if it's being invoked by Agent by action originating from the stack or from the command line, does not support force flags. Adding new parameters around here, etc, would only complicate already very complex Tamper Protection flow, see https://github.com/elastic/endpoint-dev/pull/14268 , I hope it's clear why I do not recommend complicating it further |
I've added task lists to the issue description. For the basic implementation of this issue we'll add the new API to fleet-server, and call it from the elastic-agent on uninstall (best effort with limited retries), after component uninstallation is successful. For detecting orphaned Endpoints we can take one of two approaches
The other tasks from the RFC, mainly Endpoint providing a cc @kpollich |
@ycombinator and I had a brief discussion, we're leaning towards option 2 in my comment above. I have a PR up to add the new endpoint to fleet-server. In order to support the second approach we will need to change the behaviour of the checkin API to remove the Does anyone have objections to this approach? |
Orphaned Endpoint is not a normal state so I'm leaning towards explicitly using the new API |
The |
To show on the UI seconds since last |
@michel-laterman Looking at the Tasks list in this issue's description:
Does #5302 cover both these items? |
Yes, i've checked both of them off. |
Thanks @michel-laterman. Looks the next couple of tasks are on the UI or Endpoint side:
I'll let @kpollich or @nfritts weigh in on prioritization. Meanwhile, on our end, we should take on as the next step:
I've converted this task to it's own issue — https://github.com/elastic/elastic-agent/issues/5703 — and assigned it to you for our next sprint. Removing your assignment from this issue. Thanks for the great work so far! |
I think the that Endpoint will use the new api; @intxgo correct me if i'm wrong |
Do we have work tracked to show the unenrollment/uninstall reason in the Fleet UI yet? If not, we should, otherwise this is effectively a hidden feature. |
yes, it was also bothering me taht it's only "audit" for now, but at least it's some progress. I would love to see an Agent status "Orphaned" instead of Offline if merely Elastic Agent is nuked (not working) on the machine |
Created elastic/kibana#197731 to track the UI work. |
Describe the enhancement:
When the
elastic-agent enroll
or theelastic-agent uninstall
commands are run the binary should send a final message to the current fleet server before enrolling with the new fleet server. The fleet server can use this command to change the status of the agent in fleet and to notify admins if an agent was unexpectedly uninstalled. Currently when the agent makes changes the Fleet server is unaware of the changes and this will result in lots of identical agents that are no longer active and the admins do not know which one is still the active agentDescribe a specific use case for the enhancement or feature:
One of the governance requirements for multiple compliance frameworks such as Fedramp or PCI is that we have to have alerting in place for when an endpoint security agent stops running. This feature would help bring Elastic Agent into compliance without the need for a separate auditbeat process to monitor for the agent removal.
This would also help keep fleet servers clean in a devops environment where agents are managed via code.
Link to RFC
https://docs.google.com/document/d/1gYbsGfvjc7NhbURwYNqEl25ouar81nZ_8bkpsi0Dc6Y/edit
Tasks
Future Work
The text was updated successfully, but these errors were encountered: