Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start Beats - fails with error: could not get FQDN #34910

Closed
andrewkroh opened this issue Mar 23, 2023 · 14 comments · Fixed by #34946
Closed

Cannot start Beats - fails with error: could not get FQDN #34910

andrewkroh opened this issue Mar 23, 2023 · 14 comments · Fixed by #34946
Assignees
Labels
bug Team:Elastic-Agent Label for the Agent team

Comments

@andrewkroh
Copy link
Member

I'm unable to start Metricbeat and Filebeat.

For confirmed bugs, please report:

  • Version: main from b8e0449
  • Operating System: macOS 13.2.1 (22D68)
  • Discuss Forum URL:
  • Steps to Reproduce:
  1. Try to get the Metricbeat version.
./metricbeat version
error initializing beat: failed to get host information: 1 error: could not get FQDN, all methods failed: failed looking up CNAME: lookup mac16-m1: no such host: failed looking up IP: lookup mac16-m1: no such host
@andrewkroh andrewkroh added bug Team:Elastic-Agent Label for the Agent team labels Mar 23, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@amitkanfer
Copy link
Collaborator

isn't this implementation behind a FF? Or did you enable it @andrewkroh ?

@amitkanfer
Copy link
Collaborator

Anyway, marking this as a P0 as it seems like a degradation.

@andrewkroh
Copy link
Member Author

I did not enable the feature. I didn't use a config file or specify any options. It failed with only the filebeat version sub-command that should output the build info.

@cmacknz
Copy link
Member

cmacknz commented Mar 23, 2023

What is interesting is that we do have a system test for the version sub-command using mockbeat that apparently isn't catching this: https://github.com/elastic/beats/blob/main/libbeat/tests/system/test_cmd_version.py

Whatever the cause turns out to be, we should make sure that test can catch this before fixing it.

@ycombinator
Copy link
Contributor

ycombinator commented Mar 23, 2023

@andrewkroh I just built Filebeat (OSS) from main (same commit hash as you mentioned in this issue's description) on my Intel Mac running 13.2.1 (same version as you mentioned in the issue description). Strangely enough, the command worked for me:

$ ./filebeat version
filebeat version 8.8.0 (amd64), libbeat 8.8.0 [b8e0449d12b1a4eb8072e4fb6142516aebe9af51 built 2023-03-23 22:38:52 +0000 UTC]

Same with Metricbeat (OSS):

$ ./metricbeat version
metricbeat version 8.8.0 (amd64), libbeat 8.8.0 [b8e0449d12b1a4eb8072e4fb6142516aebe9af51 built 2023-03-23 22:42:49 +0000 UTC]

[EDIT] I also tried running both commands with no network connection, since the error you posted looks like it's doing some kind of DNS lookup. Both commands still worked for me.

Hmmm... trying to figure out what else might be different between our environments so we can narrow down on exactly what needs to be setup during an automated test for this issue. What do hostname, hostname -f, and hostname -s report for you?

@ycombinator ycombinator self-assigned this Mar 23, 2023
@andrewkroh
Copy link
Member Author

% git rev-parse HEAD
b8e0449d12b1a4eb8072e4fb6142516aebe9af51
% uname -a
Darwin mac16-m1 22.3.0 Darwin Kernel Version 22.3.0: Mon Jan 30 20:38:37 PST 2023; root:xnu-8792.81.3~2/RELEASE_ARM64_T6000 arm64
% hostname
mac16-m1
% hostname -f
mac16-m1
% hostname -s
mac16-m1
% ./filebeat version
error initializing beat: failed to get host information: 1 error: could not get FQDN, all methods failed: failed looking up CNAME: lookup mac16-m1: no such host: failed looking up IP: lookup mac16-m1: no such host

@ycombinator
Copy link
Contributor

Thanks, here's the same output for mine:

$ git rev-parse HEAD
b8e0449d12b1a4eb8072e4fb6142516aebe9af51
$ uname -a
Darwin Shaunaks-MBP.attlocal.net 22.3.0 Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64 x86_64
$ hostname
Shaunaks-MBP.attlocal.net
$ hostname -f
Shaunaks-MBP.attlocal.net
$ hostname -s
Shaunaks-MBP
$ ./filebeat version
filebeat version 8.8.0 (amd64), libbeat 8.8.0 [b8e0449d12b1a4eb8072e4fb6142516aebe9af51 built 2023-03-23 22:38:52 +0000 UTC]

Looks like the only difference is your machine's hostname is a short one (no domain component). Let me mess around with that setup and see if I can reproduce your error locally. Thanks again!

@ycombinator
Copy link
Contributor

ycombinator commented Mar 23, 2023

Yep, there it is:

$ hostname
Shaunaks-MBP.attlocal.net
$ sudo scutil --set HostName shortypants
$ hostname
shortypants
$./filebeat version
error initializing beat: failed to get host information: 1 error: could not get FQDN, all methods failed: failed looking up CNAME: lookup shortypants: no such host: failed looking up IP: lookup shortypants: no such host

Okay, I'll work on a test PR that sets up a short hostname and makes sure it fails on the current build, then work on the fix.

@andrewkroh
Copy link
Member Author

andrewkroh commented Mar 24, 2023

This is the stack of the main goroutine when the error originates.

github.com/elastic/go-sysinfo/providers/shared.FQDN(fqdn.go:58)
github.com/elastic/go-sysinfo/providers/darwin.(*reader).fqdn(host_darwin.go:218)
github.com/elastic/go-sysinfo/providers/darwin.newHost(host_darwin.go:163)
github.com/elastic/go-sysinfo/providers/darwin.darwinSystem.Host(host_darwin.go:43)
github.com/elastic/go-sysinfo/providers/darwin.(*darwinSystem).Host(<autogenerated>:1)
github.com/elastic/go-sysinfo.Host(system.go:52)
github.com/elastic/beats/v7/libbeat/cmd/instance.NewBeat(beat.go:247)
github.com/elastic/beats/v7/libbeat/cmd/instance.NewInitializedBeat(beat.go:223)
github.com/elastic/beats/v7/libbeat/cmd/instance.Run.func1(beat.go:212)
github.com/elastic/beats/v7/libbeat/cmd/instance.Run(beat.go:218)
github.com/elastic/beats/v7/libbeat/cmd.genRunCmd.func1(run.go:36)
github.com/spf13/cobra.(*Command).execute(command.go:860)
github.com/spf13/cobra.(*Command).ExecuteC(command.go:974)
github.com/spf13/cobra.(*Command).Execute(command.go:902)
main.main(main.go:22)

FWIW I think the algorithm used to determine the FQDN could use some godoc to explain what it does and why. I think this is what it's doing.

  1. It performs a CNAME lookup (dig CNAME mac16-m1). If it gets a response then it trims the trailing dot and uses the result.
  2. It performs a forward lookup on the os.Hostname() value. If that returns a response with IPs then it does a reverse lookup on those until the first one yields a hostname.

In addition I think we need to document what are the prerequisites to having the FQDN feature work. That will help when we need to support users that are asking why their Beat is not reporting the FQDN of the machine after enabling the feature flag.

@ycombinator
Copy link
Contributor

FWIW I think the algorithm used to determine the FQDN could use some godoc to explain what it does and why.

Agreed, happy to document the what and why of the algorithm.

However, the "why" of the algorithm is unclear to me. In particular, why do we need to do the CNAME lookup followed by the reverse lookup. Why don't we "just" run hostname -f instead (on Mac/Linux)? @AndersonQ perhaps you can shed some light on this, since the algorithm was implemented in elastic/go-sysinfo#144?

@ycombinator
Copy link
Contributor

Had a good chat with @leehinman about the FQDN lookup algorithm.

We agreed that the algorithm is fine as-is, in that it tries to lookup the FQDN and reports an error if it fails. However, the consumers of the FQDN should not fail on error. In other words, they should treat the FQDN lookup as a "best effort" and, if it fails, log an error so we are not blind to the failure. After logging the error, they should fall back to the OS-reported hostname and continue execution.

This change in behavior will require code changes in go-sysinfo, beats, and elastic-agent. I'll make PRs to each of the repos and post links to them here.

@ycombinator
Copy link
Contributor

Added a PR to make the FQDN lookup algorithm a bit more testable: elastic/go-sysinfo#158.

@ycombinator
Copy link
Contributor

Added another PR to report FQDN errors from lookup algorithm separately so consumers can handle them with the severity they desire: elastic/go-sysinfo#159.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants