Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added more detailed logs around ES communication failure #992

Merged
merged 2 commits into from
Dec 28, 2021

Conversation

lykkin
Copy link
Contributor

@lykkin lykkin commented Dec 13, 2021

What is the problem this PR solves?

This PR adds a log line signifying the recovery of fleet server while polling ES. Also,

How does this PR solve the problem?

For the logging change:
Keeps track of the errored state of the monitor, and on the first successful communication with poll it logs at info level.

For surfacing the error through the http router:
Returns a 503 in the case where the request to ES results in an error containing connection refused. This method of error classification was copied from here, as I was unable to hunt down an error instance to compare the error against.

How to test this PR locally

make local then run the binary against a local ES instance. Kill the ES instance.
To test the logging change:
Wait til the logs show the fleet server is having difficulties communicating with ES, then restart the ES instance. A log line similar to the one below should appear shortly:

{"log.level":"info","ecs.version":"1.6.0","service.name":"fleet-server","ctx":"policy leader manager","@timestamp":"2021-12-13T06:30:03.203Z","message":"Policy leader monitor successfully recovered after 3 attempts"}

To test the error surfacing:
While the ES instance is down, attempt to run an install an agent with the fleet server, resulting in an error similar to:

{"log.level":"warn","@timestamp":"2021-12-12T22:29:19.403-0800","log.logger":"tls","log.origin":{"file.name":"tlscommon/tls_config.go","file.line":105},"message":"SSL/TLS verifications disabled.","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-12-12T22:29:20.288-0800","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":454},"message":"Starting enrollment to URL: http://localhost:8220/","ecs.version":"1.6.0"}
Error: fail to enroll: fail to execute request to fleet-server: status code: 503, fleet-server returned an error: ServiceUnavailable, message: Fleet server unable to communicate with Elasticsearch

Checklist

  • I have commented my code, particularly in hard-to-understand areas

@lykkin lykkin added cleanup backport-v8.0.0 Automated backport with mergify Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-v8.1.0 Automated backport with mergify labels Dec 13, 2021
@lykkin lykkin self-assigned this Dec 13, 2021
@elasticmachine
Copy link
Contributor

elasticmachine commented Dec 13, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-12-28T16:47:44.971+0000

  • Duration: 9 min 40 sec

  • Commit: 1e55584

Test stats 🧪

Test Results
Failed 0
Passed 186
Skipped 26
Total 212

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

cmd/fleet/error.go Outdated Show resolved Hide resolved
@lykkin lykkin removed the backport-v8.0.0 Automated backport with mergify label Dec 28, 2021
@lykkin lykkin merged commit 4daa0b5 into elastic:master Dec 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.1.0 Automated backport with mergify cleanup Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants