Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Add fleet server URL #89442

Closed
mostlyjason opened this issue Jan 27, 2021 · 39 comments · Fixed by #94364
Closed

[Fleet] Add fleet server URL #89442

mostlyjason opened this issue Jan 27, 2021 · 39 comments · Fixed by #94364
Assignees
Labels
Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@mostlyjason
Copy link
Contributor

mostlyjason commented Jan 27, 2021

Currently, our global output settings in Fleet list a Kibana URL. With the new Fleet server, we need a way for users to specify the fleet server URL.

Requirements

Updated 2020-03-10

Match current behavior for populating the URL
On ESS/ECE, the fleet server URL will be automatically populated by cloud to make it easy to get started. For self-managed clusters users must manually populate this URL in Kibana's Fleet settings when they are setting up Fleet server. We decided not to attempt to magically fill in the IP/DNS name at this time because this is error prone and opens up more complexity. Manual entry allows the user to confirm it is correct, assign a static URL, account for proxy servers, etc. This is the current behavior, so no change.

Match current behavior for multiple URLs
The user can set multiple fleet server URLs, which is also the current behavior. The first URL in the list is used to populate the add agent dialog. The Elastic Agent will connect to this URL when installing, then download the rest of the URLs in the agent policy. When a fleet server is added or removed, the agent policies are updated automatically. The Elastic Agent will iterate through URLs until it connects to one successfully. This allows for automatic failover and subnets.

Add a confirmation dialog for changes
One new feature we'd like to add is a confirmation dialog when the user changes the Fleet server URL. There is a risk if this field is incorrect that agents will lose connection to Fleet, and the agents will need to be manually reenrolled. A confirmation dialog will explain these risks and allow the user to proceed if desired. We can use this dialog in both ESS/ECE and self-managed use cases. On ESS/ECE, we can add a section to our troubleshooting guide explaining how to reset the URL if needed for both cloud and self-managed use cases. We will not completely lock the field or build cloud-specific logic at this time. If we see many users running into a problem we can add more guardrails later.

Stretch goal: add the same confirmation dialog for the ES output URL

@mostlyjason mostlyjason added Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team labels Jan 27, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/ingest-management (Team:Ingest Management)

@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Feature:Fleet)

@nchaulet
Copy link
Member

Thanks for creating this one @mostlyjason, should we support to have two field in the settings Kibana URL and Fleet Server URL in 7.13 so we can have a seamless transition for the user, a user enrolled agent in kibana, can set the fleet server url to migrate them to fleet server.

@mostlyjason
Copy link
Contributor Author

@nchaulet that sounds reasonable. I want to understand the process for migrating from kibana to fleet server better, but we can treat that as a separate issue.

@mostlyjason mostlyjason changed the title [Fleet] Rename Kibana URL [Fleet] Add fleet server URL Feb 3, 2021
@ruflin
Copy link
Contributor

ruflin commented Feb 16, 2021

There is also the case that potentially multiple fleet-server exists with different URLs. One for example would be in Cloud and the other one running on prem.

@mostlyjason
Copy link
Contributor Author

@ruflin that is a good point! Are you thinking it could be per agent policy? Either way, it'd be nice to have a global default so new agent policies can be initialized automatically. We could start with that and come back to per-agent policy URLs later.

@ruflin
Copy link
Contributor

ruflin commented Feb 16, 2021

I'm not sure I follow the above. The policy does not really matter here. An enrollment key can be used with any fleet server and all policies can be retrieved through all fleet-servers.

Lets take the current enrollment screen. Currently you can select the policy to enroll an Agent into. As an advanced option, there could also be an option that you can select the fleet-server you want to enroll into so we can show you the correct enrollment command.

Screenshot 2021-02-16 at 20 22 40

Thinking of setting the Kibana URL today, this would more become like a dropdown to select the default and maybe configure the correct host for each in case it is reported with the wrong value by fleet-server:

Screenshot 2021-02-16 at 20 24 12

@mostlyjason
Copy link
Contributor Author

Interesting the dropdown method is simple and could work as an MVP.

I can think of a few use cases where it'd be advantageous to add it to the agent policy. What if the user wants to take a fleet server out for maintenance or replacement? One option is to use a load balancer to fail over to another instance. If the user doesn't have a load balancer, they'd have to reenroll all the agents. The nice thing about storing the URLs in the agent policy is that we could centrally manage them without reenrolling the agents. To take a fleet server out for maintenance, just update the policy with a new URL. This is the same way the elasticsearch URL is updated. If we had a global setting for a fleet server (or a list of them), we could add it to agent policy along with the elasticsearch URL.

For a longer term use case, I imagine users wouldn't want to always default to the same fleet server in the add agent dialog because it could get overloaded. One solution is to randomize the default fleet server. Alternatively, if the operator set up agent policies for each geographic region, it could default to the matching fleet server for each region. The person adding agents doesn't need to understand geographic assignments, because there is a smart default. We may not even need a dropdown box, which would simplify the add agent dialog. Choosing a non-default fleet server could be an advanced use case.

This one is a stretch, but an even cooler long term use case is that the fleet server could be assigned via variable in the agent policy. That means the fleet server shown in the agent dialog is just for enrollment and retrieving the initial agent policy. Then it can be reassigned dynamically. This would be great if you want to have a single script to deploy agents across multiple regions. I don't think we need this near term, just thinking of another advantage of adding it to the agent policy.

Can you think of any downsides?

@ruflin
Copy link
Contributor

ruflin commented Feb 18, 2021

For the maintenance part, I think it should be the same as with Elasticsearch where you can have an array of hosts (assuming no proxy exists). This is more for the on prem use case. So the Elastic-Agent itself will switch over to an alternative url if one is not available. But this makes the drop-down trickier.

We should probably discuss the use cases where users have fleet-servers in different regions and not all agents have access to all fleet-servers. I would assume at first, we can get away with just making it an array and the Elastic Agent will pick the "best" fleet server?

@mostlyjason
Copy link
Contributor Author

@ruflin how would the elastic agent get an array of fleet server URLs? Would you pass an array in via command line parameter, or would you seed one value and deliver the rest via agent policy or an ES document?

Also to pick the "best" fleet server from an array, would it start with a random one and switch to the others if it can't connect?

@ruflin
Copy link
Contributor

ruflin commented Feb 18, 2021

I would assume that policy would contain a list but have not fully thought it trough yet. Missed the detail around the command line. I'm thinking of the command line as the initial connection setup but that fleet-server could disappear over time. So as soon as the initial setup is done, it should rely the content of the policy. @blakerouse has probably more / better opinions here.

What best means is a tricky question, but round robin / random selection sounds like a good start.

@blakerouse
Copy link

Elastic Agent already supports multiple Kibana URL's to connect to Fleet. The Kibana URL's are added to the policy, so if you add a new Kibana it updates the policy and the Elastic Agent gets that new URL. Elastic Agent will round-robin connect to the Kibana's in the policy.

To the Elastic Agent there is no different between talking to Kibana or a Fleet Server. So the same code path, round-robin, policy updating just works for the Fleet Server in Elastic Agent.

We probably need to add a new configuration key/values to the policy so we can migrate to the Fleet Server from an existing Kibana. With that new configuration the Elastic Agents can transition over to the newly deployed Fleet Servers.

For enrollment Elastic Agent only connects to the Kibana/Fleet Server that is provided in the --url parameter. Once connected and enrolled then it will perform round-robin on the list of Kibana/Fleet Server from the policy.

@mostlyjason
Copy link
Contributor Author

Thanks Blake! I'm glad it already works that way. Sorry for the noise I should have seen it in the agent policy when I looked.

I just tested multiple URLs with an existing cluster. It looks like only 1 URL is shown in the add agent dialog. Thats probably fine because its just used for the initial connection setup. A dropdown could work for the use case where a user wants to choose a specific fleet server. Do we really that for GA or can we add it later? I'd lean towards waiting for customer feedback because the user can edit the value on the command line.

I assume if the first fleet server in the list is not accessible it will automatically iterate through the list until it can successfully connect to one? That will handle the use cases of taking a fleet server out for maintenance, too much load on one server, and some fleet servers not being reachable from all agents. It doesn't account for the use case of wanting to prefer specific fleet servers based on geography, but I don't think we need to handle that use case for GA.

configure the correct host for each in case it is reported with the wrong value by fleet-server

Does the fleet server report its own URL? I assumed that cloud would pass in the proxy endpoint through kibana.yml and in self-managed clusters the user has to manually add the fleet server URLs either in kibana.yml or global settings? If the user knows the correct values, they can just enter them in the settings.

@ruflin
Copy link
Contributor

ruflin commented Feb 19, 2021

Agree to keep it as simple / minimal as possible for the first version. As long as a user can edit these in Kibana like today the Kibana endpoint, I assume things should work? I think we should have it in Kibana like today and not move to it kibana.yml as otherwise changes require a restart which is not nice.

@hbharding
Copy link
Contributor

Assuming we remove Kibana URL, we can replace it with "Fleet Server URL" and make it a combobox to support multiple URLs, like we do for the ES URL. For on prem users who have yet to set up Fleet Server, I imagine this field would appear empty. When Fleet Server(s) are added, I think this field should update to include the Fleet Server URLs automatically, otherwise the user would have to know to go here to add the URL manually.

(screenshot from 7.9, quickest I could find)
image

@mostlyjason
Copy link
Contributor Author

mostlyjason commented Feb 22, 2021

I think this field should update to include the Fleet Server URLs automatically, otherwise the user would have to know to go here to add the URL manually.

How much extra effort is this? The simple solution is that the Fleet server would attempt to print its own URL on the installer stdout, which the user could verify and copy into the Fleet settings screen. The disadvantage is that the user has to leave the onboarding flow for an integration to go to the Fleet settings page, paste in the URL, then come back to the integration onboarding flow again. If the server could notify Kibana of the new URL, we could detect it and populate the URL in the add agent dialog automatically. The user could verify that its correct here when adding their agent.

@blakerouse
Copy link

At the moment an enrolled Fleet Server places its IP addresses into the .fleet-servers index. The idea behind that was so Kibana could use that data to generate the correct URL for the server.

This might need to be extended more to add which IP address is attached to the interface that is used as the gateway for the machine. Giving that IP address precedence over the other IP addresses in the system.

@hbharding
Copy link
Contributor

The way I read @blakerouse's comment is that Kibana already has access to the information, so it should be possible to populate this information into the "Fleet Server Url(s)" setting?

From issue description:

With the new Fleet server, we need a way for users to specify the fleet server URL

@mostlyjason can you expand on why users should be able to specify the Fleet server URL? I'm probably missing something, but I think it makes sense for this field to be auto-populated (like it is today for Kibana URL) when Fleet Servers are added, both for cloud and self-managed.

As an aside, we should also provide better description about what these fields are used for and maybe even warn the user about making changes since these could have a big impact on their agents.

Pending the outcome (and if necessary), can you make a design issue that summaries what needs to change? The idea of being able to specify a "default" fleet server sounded interesting.

@mostlyjason
Copy link
Contributor Author

mostlyjason commented Feb 24, 2021

can you expand on why users should be able to specify the Fleet server URL

I'm thinking of the use case where the fleet server is running behind a proxy, and the agent isn't able to determine its external URL. In this case, the auto-populated information would be incorrect. I don't see a parameter in blake's install example that contains the external URL of Fleet server. That could be one way to populate it, the other being the UI. The advantage of exposing this in the UI is that it provides viability to users on what URLs are present in the list.

@blakerouse is there already a concept of a default fleet server in the index? That might be nice because the cluster may have fleet servers in private networks that are not accessible externally, so they wouldn't want to use it as a default server necessarily.

This might need to be extended more to add which IP address is attached to the interface that is used as the gateway for the machine. Giving that IP address precedence over the other IP addresses in the system.

Good point! Sorry I didn't understand your comment initially, but I edited my comment and this is a good point. I think we can populate more than one interface if there are multiple. The other agents can iterate through the list to find one that routes.

@mostlyjason
Copy link
Contributor Author

I'm thinking about the concept of auto-populating the fleet server URL/IP and I wonder if there are security concerns? If we populate a DHCP IP or URL, could it be assigned to compromised fleet server later? Ideally, it would be a static IP or URL that is accessible from the endpoints, accounts for proxies, etc. @blakerouse do we ask for user confirmation before saving it to ES and provide a way for users to edit it if needed? Could be via parameter or interactive prompt.

@blakerouse
Copy link

@mostlyjason At the moment we only populate the index with the list of IP addresses. I was currently leaving it up to Kibana to use that information to create the URL.

I could see it being an Administrators job to create the initial URL presented to the users, by selecting from the list of IP's or by providing a DNS address.

@hbharding
Copy link
Contributor

I've been working on the Fleet Server onboarding UX #89396 and one of the last steps (for self-managed) is for the user to confirm the Fleet Server URL which will be used to enroll agents. I made a mockup for the Fleet Settings flyout and shared with @blakerouse and @nchaulet today. The idea is that after Fleet Server connects to ES, it will add its IP addresses to an editable list in Kibana, shown in the Fleet Settings flyout. The user will have to select and confirm a default URL to use. They can also add a DNS address to the list and choose that as the default URL.

A few questions and ideas came up that we need to resolve. More detail below:

image

  1. It was my understanding that Fleet Server would add all of its IP addresses to the .fleet_servers index when it connects to ES. @blakerouse explained this is the current behavior, but we can change the code to improve this. Fleet Server should only report a single addressable IP address, and we should automatically set that as the default. This is a todo item for Blake.
  2. When a Fleet Server connects, when do agent policies get updated? Applying these settings will push updates to all policies and affect every agent. Should we:
    • Automatically push updates to agent policies whenever a Fleet Server connects?
    • Require the user to manually approve the list of Fleet Server URLs before pushing changes?
  3. For cloud deployments, we can provide a single DNS address rather than list multiple IP addresses in the case of multiple nodes. Should the user be able to remove the cloud URL?
    • Blake explained a use case where a user may want to use APM Server on cloud but not Fleet Server. They may want to run a self-managed Fleet Server on their own infrastructure. Sounds like a unique use case we may not need to support during Beta.
  4. The list of Fleet Server URLs comes from the .fleet_servers index, but Blake mentioned users can't delete documents in the index since they refresh every ~20 seconds. We'll need a solution that allows users to be explicit about what URLs are allowed.

@blakerouse @nchaulet please correct me if anything I said was wrong or add more as you see fit.

Also, any input on the form field descriptions would be much appreciated. I tried my best based on my own understanding.

@ruflin
Copy link
Contributor

ruflin commented Mar 9, 2021

Few thoughts:

  • fleet-server should only report 1 thing, be it an IP or a url
  • In case of multiple fleet-servers, can we require for now that they must be all behind a proxy or at least all must be accessible? As long as the Fleet API was in Kibana we also did not really account for the case where fleet-server could be deployed in any place. We potentially get there one day but to keep it simple, I would not allow this. So for enrollment, we would always just pick the first url in the list, no user selection.
  • fleet-server should be an array like today the Kibana urls. User can edit these and remove an url if needed. If the fleet-server is still running I assume it would get readded automatically? In Cloud we should not allow to modify it (interestingly we allow it today for Kibana in Cloud).
  • Update for new fleet-server: To keep it simple, we could just wait until the policy is modified next time. What happens today if the Kibana urls are updated?

@simitt
Copy link
Contributor

simitt commented Mar 9, 2021

3.For cloud deployments, we can provide a single DNS address rather than list multiple IP addresses in the case of multiple nodes. Should the user be able to remove the cloud URL?

Blake explained a use case where a user may want to use APM Server on cloud but not Fleet Server. They may want to run a self-managed Fleet Server on their own infrastructure. Sounds like a unique use case we may not need to support during Beta.

Yes for Cloud we want to show the publicly available, single Fleet Server URL.

@ph
Copy link
Contributor

ph commented Mar 9, 2021

I would like to move the discussion simplicity, If we promote the proxy deployment scenario this would remove a lot of complexity on our end, a single URL, if you have more than one Fleet-Server uses a proxy. If your scale require to have multiple fleet-server you probably have a proxy available to you. By doing so the Cloud and the on-premise deployment is more similar.

I don't think we are in a position or we have a requirement to auto-detect IPs or doing anything magic.

@nchaulet
Copy link
Member

nchaulet commented Mar 9, 2021

I would like to move the discussion simplicity, If we promote the proxy deployment scenario this would remove a lot of complexity on our end, a single URL, if you have more than one Fleet-Server uses a proxy. If your scale require to have multiple fleet-server you probably have a proxy available to you. By doing so the Cloud and the on-premise deployment is more similar.
I don't think we are in a position or we have a requirement to auto-detect IPs or doing anything magic.

In this case configuring the Fleet server url will be really similar to what we currently have for Kibana url and ES url and this would be automatically configured in cloud

@blakerouse
Copy link

Yes after thinking about it for a day and speaking with @ph this morning. I think we should just go with exactly what we have with Kibana today. Just renaming it from Kibana URL to Fleet Server URL.

This would greatly simplify the implementation. With the current approach it would never be 100% accurate that we would be providing the correct addresses. There is the whole issue with when to update policy based on new Fleet Servers, etc.

They already have to get the URL of elasticsearch correct and set it in the Kibana configuration on start or by updating it in the Setting flyout. Why not just make it the same for Fleet Server and we remove all the magic. The magic is just going to either be wrong, open up more issues, and introduce more complexity.

@ph ph assigned nchaulet and unassigned afgomez Mar 9, 2021
@mostlyjason
Copy link
Contributor Author

mostlyjason commented Mar 9, 2021

we remove all the magic. The magic is just going to either be wrong, open up more issues, and introduce more complexity.

++

Exactly what we have with Kibana today. Just renaming it from Kibana URL to Fleet Server URL.

Sounds good to me! We already offer multiple kibana URLs today. This seems like a powerful feature allowing management across multiple networks and automatic failover with a relatively simple solution, no proxy required. We already pick the first URL off the list in the add agent dialog, and the agent already tries the next server until it finds one that successfully connects. What's the downside of keeping the current behavior?

@blakerouse
Copy link

The downside we lose by keeping the current behavior is that the new behavior in some cases would have been more streamlined and require less configuration from an administrator. But I think the fact that it will not always be correct and will still require manual configuration and the complexity of adding out weights the benefits greatly.

@hbharding
Copy link
Contributor

hbharding commented Mar 10, 2021

Here's my take at a summary of what's been discussed. I'd like to close out the items which are not clear to me and come to a decision.

What's clear to me

  • We won't auto-detect the Fleet Server URL. Users will have to add their Fleet Server URL or a DNS proxy manually after they connect a Fleet Server.
  • Users should not be able to modify the Fleet Server URL from cloud. (@ruflin)
  • On cloud, the Fleet Server URL will be pre-populated (@nchaulet)

Updated:

  • Fleet Server URL input accepts an array of values. The first URL provided will be shown in enrollment command we show in the UI.
  • Users can add additional Fleet Server URLs on both cloud and self-managed clusters.
  • These settings are applied to all agent policies when the user clicks save in "Fleet Settings".

What's not

  • Is the Fleet Server URL input an array of values or just a single URL?
    • @ruflin says "fleet-server should be an array". @blakerouse says to keep the current behavior, which is an array.
    • I think @ph suggests it should just be single url for simplicity. If a user has multiple Fleet Servers, then they probably have access to a proxy. @ruflin also says "in the case of multiple fleet-servers, can we require they must be behind a proxy"
  • If we decide there can be multiple Fleet Server URLs (array), does the position of the URLs matter?
    • From Jason: We already pick the first URL off the list in the add agent dialog, and the agent already tries the next server until it finds one that successfully connects.

    • I think I recall Blake or Chaulet saying that if an agent tries to enroll and can't connect using the first URL in the list, then it will never gain access to the full list of URLs. Only once it connects to Fleet Server will it receive the full list. This leads me to believe that the position in the list is important for enrollment, especially if we're not providing a way for the user to select a Fleet Server URL in the "add agent" flyout.
  • For cloud deployments, are users able to add additional URLs, or is the entire input disabled?
    • I'll need to check if EuiComboBox allows some values to be disabled, but others editable. It may be simpler to disable the entire input and only allow a single URL for cloud.
  • When does a Fleet Server URL get applied to an agent policy?
    • After the user clicks "save" in "Fleet Settings", apply to all policies?
    • After a user updates the Fleet Server URL(s), user will need to modify and save each agent policy. (@ruflin's suggestion to "keep it simple")

@simitt
Copy link
Contributor

simitt commented Mar 10, 2021

For cloud deployments, are users able to add additional URLs, or is the entire input disabled?

I'd say start simple and disable the whole input; in case we identify use cases we could losen the restrictions later.

@blakerouse
Copy link

Is the Fleet Server URL input an array of values or just a single URL?

I think its best to keep the current behavior and have it be an array.

If we decide there can be multiple Fleet Server URLs (array), does the position of the URLs matter?

They don't really matter, because if you have a Fleet Server URL in the array, then all Agents really should be able to communicate to that Fleet Server otherwise it does not belong in the list.

So picking the first in the array makes this simple and easy for a user to understand that it just shows the first for enrollment.

When does a Fleet Server URL get applied to an agent policy?

Keep the current behavior - After the user clicks "save" in "Fleet Settings", apply to all policies.

@mostlyjason
Copy link
Contributor Author

++ on Blake's response

Users should not be able to modify the Fleet Server URL from cloud. (@ruflin)

I just had a chat with Ruflin on Slack. We came up with an alternative idea that would explain to users the impact/risks of changing this field and ask if they want to continue in a confirmation dialog. We can also add a section to our troubleshooting guide explaining how to recover from an incorrect or missing URL by updating this field and then reenrolling agents. The advantages are that it still allows cloud users to add their own fleet servers, it would also benefit self-managed users, and it reduces the amount of special code we need for cloud. We can apply this same behavior to the ES output field as well.

@nchaulet
Copy link
Member

For the cloud part how we will get the url currently we go the ES and Kibana URL from the cloudId are we going to have this too? (@simitt do you know how it's done for APM server currently?)

@mostlyjason
Copy link
Contributor Author

This thread is getting long so I added a summary at the top based on the latest conversation so we can move towards a clear definition. Please let me know if you have concerns with that approach.

@hbharding
Copy link
Contributor

Thanks Jason - i just updated my previous comment / summary.

@ph
Copy link
Contributor

ph commented Mar 10, 2021

Thanks everyone I like where this is going!, ++++ on @mostlyjason for updating the description.

@hbharding
Copy link
Contributor

Forgot to post updated wireframe. I'm going to open a separate design issue where i'll post final designs. Need to account for the confirmation step still.

image

@hbharding
Copy link
Contributor

@nchaulet I added designs for this here #94624 and should be ready to be worked on

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Fleet Fleet team's agent central management project Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants