Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Elastic-Agent] Use internal port for local fleet server #28993

Merged
merged 7 commits into from
Nov 23, 2021

Conversation

michalpristas
Copy link
Contributor

@michalpristas michalpristas commented Nov 16, 2021

What does this PR do?

What this PR solves is a problem when agent got unenrolled on heavier load when agent managing fleet server cannot checkin to it's own server so it will fallback to unenroll.

Problem is solved by adding internal endpoint which is used for communication on local network (with agent handling fleet server)
It lets FS to spin up 2 set of handlers, one on public 8220 and one on port defined in config.

Why is it important?

Prevents unwanted un-enrolls of the most important agent in deployment

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

This needs to be tested with work on fleet-server Link: elastic/fleet-server#880

  • Start stack
  • Install agent with FS in a policy
  • Check ports
sh-3.2# lsof -i -P | grep LISTEN | grep fleet
fleet-ser  7056            root   19u  IPv4 0xba7881a9227099a5      0t0    TCP localhost:{random_port} (LISTEN)
fleet-ser  7056            root   21u  IPv6 0xba7881a91284721d      0t0    TCP *:8220 (LISTEN)
  • run wireshark, set filter to random port, there should be some comm
  • set filter to 8220 port, there should be no comm
  • enroll new agent, from another VM
  • there should be some comm on both ports

@michalpristas michalpristas added bug backport-v7.16.0 Automated backport with mergify labels Nov 16, 2021
@michalpristas michalpristas self-assigned this Nov 16, 2021
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 16, 2021
@jlind23 jlind23 added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Nov 16, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 16, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Nov 16, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-11-19T13:54:59.755+0000

  • Duration: 86 min 17 sec

  • Commit: 778c564

Test stats 🧪

Test Results
Failed 0
Passed 7128
Skipped 16
Total 7144

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@jlind23 jlind23 requested review from ruflin and lykkin November 17, 2021 08:28
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share your thoughts on why you made port and host configurable?

@@ -308,6 +311,24 @@ func enroll(streams *cli.IOStreams, cmd *cobra.Command, args []string) error {

ctx := handleSignal(context.Background())

if localFleetServer := fServer != ""; localFleetServer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this to a separate function and then have some unit tests on it to validate the behaviour?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that port should be configurable. What if the user is already using that port for any other process running?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down a random port is selected that is free. So this conflict should never happen.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is really hard to understand. Can you break it up somehow?

Copy link
Contributor Author

@michalpristas michalpristas Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i made it configurable to make the option available for configuring port. if user wants to have it on some managed port. i can take it out if that's something not important atm.

@@ -308,6 +311,24 @@ func enroll(streams *cli.IOStreams, cmd *cobra.Command, args []string) error {

ctx := handleSignal(context.Background())

if localFleetServer := fServer != ""; localFleetServer {
if fInternalHost == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the use case where we would not use localhost? Should we even support something else?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fleet-server and its own Elastic Agent are always gonna be on the same host aren't they?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be on the same host. Elastic Agent starts fleet-server as a process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it will always be a localhost communication. Gotcha.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should always be on localhost, however making it overridable could be useful for testing.

@@ -308,6 +311,24 @@ func enroll(streams *cli.IOStreams, cmd *cobra.Command, args []string) error {

ctx := handleSignal(context.Background())

if localFleetServer := fServer != ""; localFleetServer {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is really hard to understand. Can you break it up somehow?

@@ -308,6 +311,24 @@ func enroll(streams *cli.IOStreams, cmd *cobra.Command, args []string) error {

ctx := handleSignal(context.Background())

if localFleetServer := fServer != ""; localFleetServer {
if fInternalHost == "" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should always be on localhost, however making it overridable could be useful for testing.

@@ -384,3 +407,19 @@ func mapFromEnvList(envList []string) map[string]string {
}
return m
}

func getRandomPort() (uint16, error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works on all os's? Seems risky to me. Assuming that the OS will allow a rebind in a short period of time and/or won't decide to be aggressive and reuse port elsewhere.

Can we just default to 8221 or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was thinking also about reversal flow, and if we agree that having this configurable is ok.
8221 is ok as well, seems it's not used by a lot of services. initially i wanted to avoid low ports, but i'm ok with this as well

@jlind23 jlind23 added the v8.0.0 label Nov 17, 2021
@@ -32,6 +32,8 @@ rules:
selectors:
- fleet.server.host
- fleet.server.port
- fleet.server.internal_host

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, thanks

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with exposing the internal fleet-server port as it can help for testing. But lets make sure we note that it is unsupported in production in our docs and the log files. Users should NOT use this flag.

@@ -401,6 +405,11 @@ func (c *enrollCmd) prepareFleetTLS() error {
if c.options.URL == "" {
return errors.New("url is required when a certificate is provided")
}

if c.options.FleetServer.InternalPort > 0 {
c.options.InternalURL = fmt.Sprintf("%s:%d", defaultFleetServerInternalHost, c.options.FleetServer.InternalPort)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a long entry here which internal port is used and that this is not "officially" supported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can put it there, but as soon as you have something on a machine which binds this port there needs to be official workaround and i think this is it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets wait until this happens. Having the log message and user pinging us on it will also help us detect if this might become a common problem and the port we selected is not ideal.

@@ -865,6 +878,9 @@ func createFleetServerBootstrapConfig(
if port == 0 {
port = defaultFleetServerPort
}
if internalPort <= 0 {
port = defaultFleetServerInternalPort
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be if internalPort > 0 { internalPort = defaultFleetServerInternalPort } ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your condition sets default for a case when somebody tried to configure it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️ Yes. Why is it that on line 878 we have == and here <=?

But I think the second part still holds, this should be assigned to internalPort and not port? Maybe I'm blind here too :-D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, fixed

defaultFleetServerHost = "0.0.0.0"
defaultFleetServerPort = 8220
defaultFleetServerInternalHost = "localhost"
defaultFleetServerInternalPort = 8221
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make sure we document this internal as needed by fleet-server. @jlind23 Can you make sure we follow up on this? We also need to have this in the changelog as if by chance someone uses this port already on the machine, it would break things (hopefully unlikely).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin Where should it be documented though?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two ideas:

@bmorelli25 Where would these internal docs around elastic-agent and fleet-server fit best?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is @dedemorton's territory. DeDe, can you take a look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so it sounds like you don't want to document the internal port setting in the main docs (command reference + Fleet Server settings) because users will not set it in a production environment. It also sounds like we're covered for the situation where the port is already in use (the code will select an unused port when there's a conflict).

I'd like to understand specific use cases for this setting, though, because I'm always suspicious of hidden settings that we don't want to document for users. We have a lot of users who are on the support team and might look in the reference first. My preference is to have reference docs that are complete, but to make the usage clear in the description.

Sounds like it also makes sense to cover this setting in the troubleshooting docs, but we need to explain when/why to change this setting during testing.

We could also mention in the Fleet Server docs that Fleet Server uses port 8221 internally, but will look for another port if that one isn't available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will not look for antoher port. We might want to use the port for testing purpose or as an escape hatch if indeed a user has a conflict.

There might we 2 things we should document:

  • Architecture / Internals: What port is used for internal communication -> public
  • Hidden config to change it -> developer focus, mentioned that not officially supported

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only comment left is around the log message.

@michalpristas michalpristas merged commit 9b154ad into elastic:master Nov 23, 2021
mergify bot pushed a commit that referenced this pull request Nov 23, 2021
[Elastic-Agent] IUse itnernal port for local fleet server (#28993)

(cherry picked from commit 9b154ad)
v1v added a commit to v1v/beats that referenced this pull request Nov 24, 2021
…ws-on-file-changes

* upstream/master:
  override host on statsd metricset (elastic#29103)
  Skip config check in autodiscover for duplicated configurations (elastic#29048)
  Change "filebeat.config.modules.enabled" to "true" (elastic#28769)
  Remove deprecated spool queue from Beats (elastic#28869)
  Add `beat` field back to beat.stats (elastic#29094)
  Revert "Move labels and annotations under kubernetes.namespace. (elastic#27917)" (elastic#29069)
  heartbeat: remove w2008 in the CI (elastic#29093)
  Remove deprecated `--template` and `--index-policy` flags (elastic#28870)
  Fix parsing of apache trace log levels (elastic#28717)
  [Elastic-Agent] IUse itnernal port for local fleet server (elastic#28993)
  [Heartbeat] Log error on dupe monitor ID instead of strict req (elastic#29041)
  Enable pprof for elastic-agent and beats (elastic#28983)
@michalpristas michalpristas changed the title [Elastic-Agent] IUse itnernal port for local fleet server [Elastic-Agent] Use internal port for local fleet server Dec 7, 2021
@michalpristas michalpristas added the backport-v8.0.0 Automated backport with mergify label Dec 7, 2021
mergify bot pushed a commit that referenced this pull request Dec 7, 2021
[Elastic-Agent] IUse itnernal port for local fleet server (#28993)

(cherry picked from commit 9b154ad)
michalpristas added a commit that referenced this pull request Dec 7, 2021
…29088)

[Elastic-Agent] IUse itnernal port for local fleet server (#28993)

(cherry picked from commit 9b154ad)

Co-authored-by: Michal Pristas <[email protected]>
michalpristas added a commit that referenced this pull request Dec 7, 2021
…29320)

[Elastic-Agent] IUse itnernal port for local fleet server (#28993)

(cherry picked from commit 9b154ad)

Co-authored-by: Michal Pristas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.16.0 Automated backport with mergify backport-v8.0.0 Automated backport with mergify bug Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants