Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: Agents installed with Default fleet server agent after setting multiple other Fleet Servers goes online and offline repeatedly. #25940

Closed
amolnater-qasource opened this issue May 27, 2021 · 12 comments
Assignees
Labels
bug impact:high Short-term priority; add to current release, or definitely next. P1 Team:Fleet Label for the Fleet team v7.14.0

Comments

@amolnater-qasource
Copy link

Kibana version: 7.14.0 Snapshot Kibana cloud environment

Host OS and Browser version: All, All

Build Details:

  Artifact link used: https://snapshots.elastic.co/7.14.0-f385fee6/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip
  Build: 41264
  Commit: 58f7eff1ce01e9ae850803e092d0b30ee7251849

Preconditions:

  1. 7.14.0 Snapshot Kibana cloud environment should be available.
  2. Windows 10 x64 Fleet Server agent must be installed on 7.14.0 snapshot with New Policy having Fleet Server Integration.

Steps to reproduce:

  1. Login to Kibana environment.
  2. Add url for self created Fleet Server Agent "https://10.0.x.x:8220" under Fleet Settings.
  3. Install another agent with Default Fleet Server Agent url.
  4. Observe agent is installed successfully.
  5. After few minutes observe agent going inactive and agent is trying to connect to all the Fleet Servers available in Fleet Settings.
  6. Observe below errors under logs tab:
{"log.level":"error","@timestamp":"2021-05-27T02:43:37.042-0400","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://localhost:8220/api/fleet/agents/7587872d-7351-41af-b9d2-da9ff6976e0a/checkin?\": dial tcp 127.0.0.1:8220: connectex: No connection could be made because the target machine actively refused it.","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-05-27T02:45:29.441-0400","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://10.0.5.84:8220/api/fleet/agents/7587872d-7351-41af-b9d2-da9ff6976e0a/checkin?\": x509: certificate signed by unknown authority","ecs.version":"1.6.0"}

Expected Result:
Agents should remain healthy(active) throughout.

Logs:
Logs.zip

Note:

  • This issue is also observed on 7.13.0 released build.

Screenshots:
Fleet connectivity

@amolnater-qasource amolnater-qasource added bug impact:high Short-term priority; add to current release, or definitely next. Team:Fleet Label for the Fleet team labels May 27, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/fleet (Team:Fleet)

@amolnater-qasource
Copy link
Author

@dikshachauhan-qasource Please review.

@dikshachauhan-qasource
Copy link

Reviewed and assigned to @EricDavisX

@EricDavisX
Copy link
Contributor

Thanks for logging. I've been investigating the same area, and am setting up the Demo / Test server to help us test this routinely.

I see from your notes and screenshots that:

In step 2: Add url for self created Fleet Server Agent "10.0.x.x:8220" under Fleet Settings.

  • you kept the cloud-hosted FS Agent as well as adding one more real one, and adding another 'localhost' FS

In step 4) After few minutes observe agent going inactive and agent is trying to connect to all the Fleet Servers available in Fleet Settings.

  • is the Agent in this step a Fleet Server Agent on 'New Policy 02' or is it a non-Fleet Server running Agent? An Agent running Fleet Server should not itself be 'enrolled' with "./elastic-agent install -url". If this is the case it points to some error handling we can do better to error up front, perhaps.

I will report back more findings myself, let us know, @dikshachauhan-qasource thanks.

@amolnater-qasource
Copy link
Author

amolnater-qasource commented May 31, 2021

Hi @EricDavisX
Thanks for looking into this.

you kept the cloud-hosted FS Agent as well as adding one more real one, and adding another 'localhost' FS

Yes

is the Agent in this step a Fleet Server Agent on 'New Policy 02' or is it a non-Fleet Server running Agent?

This is a non-Fleet Server agent, the elastic-agent that is installed with the first URL, that is the cloud hosted FS Agent.
Note: During this elastic-agent installation we had other FS agents as well running on the Kibana.

Further, as per logs we observed that the new secondary agent[ i.e elastic-agent] tries to connect to all the existing Fleet Servers. When it fails to communicate with other Fleet Server Agents, this secondary agent gets "Offline".
As soon as it reconnects to the cloud hosted FS agent it again gets "Healthy".

Thanks
QAS

@ruflin
Copy link
Contributor

ruflin commented May 31, 2021

Is my understanding correct that this scenario mixes an on prem and hosted fleet-server? Even though it should work I remember we mentioned that at the moment it is not something we support (@ph @mostlyjason please clarify if this is not correct).

Having said that, we should still dig into this to understand what happens as we should support this in the future.

@mostlyjason
Copy link

mostlyjason commented Jun 1, 2021

I think we talked about not supporting this case, but I'm not sure we made a decision on it. The behavior we defined here allows for multiple hosts elastic/kibana#89442 (comment). It says "The Elastic Agent will iterate through URLs until it connects to one successfully. This allows for automatic failover and subnets."

It looks like it only tried a single host in 41 seconds and then switched to the degraded status. It's great that it eventually switches to a healthy host, but it seems unexpected that it changes the status in between. It seems like the logic should be that it tries all the hosts and only switches to degraded if none succeed? Could we just treat this as a bug?

{"log.level":"info","@timestamp":"2021-05-26T23:23:17.718-0700","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-05-26T23:23:17-07:00 - message: Application: metricbeat--7.14.0-SNAPSHOT--36643631373035623733363936343635[f422f8a3-2003-44ce-ba97-c36b2fe59468]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-05-26T23:23:17.942-0700","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://10.0.5.84:8220/api/fleet/agents/f422f8a3-2003-44ce-ba97-c36b2fe59468/checkin?\": x509: certificate signed by unknown authority","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-05-26T23:23:58.072-0700","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-05-26T23:23:58-07:00 - message: Application: endpoint-security--7.14.0-SNAPSHOT[f422f8a3-2003-44ce-ba97-c36b2fe59468]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'","ecs.version":"1.6.0"}

@EricDavisX
Copy link
Contributor

I think it is important to support this for 2 reasons:

@EricDavisX
Copy link
Contributor

Another mention in slack today of users who want to set up a Fleet Server in the local intranet and then one separately on a DMZ (low side) from Patrick Boulanger @pboulanger74

@ph
Copy link
Contributor

ph commented Jun 22, 2021

@blakerouse This seems to be a recurrent issue could you take a look? The report date a few weeks ago but the last comment from Eric is from last week.

Not sure what is the issue is here, we need to take a serious investigation at hi @nimarezainia @urso @ruflin

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have revalidated this issue on 7.14.0 Snapshot and found it not reproducible today.

  • Agents installed with Default fleet server agent after setting multiple other Fleet Servers remains Healthy.

Build details:

Build: 41896
Commit: e26582638988179d134e77e59b66ed8f982ab064
Artifact link: https://snapshots.elastic.co/7.14.0-df0371f0/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip

Hence we are closing this out. However, if in further testing we observed this issue again we will reopen this.

Thanks
QAS

@EricDavisX
Copy link
Contributor

see above, i logged a proper place-holder for this work as this one captured only one part of the puzzle, and was closed when that one bug was fixed. ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug impact:high Short-term priority; add to current release, or definitely next. P1 Team:Fleet Label for the Fleet team v7.14.0
Projects
None yet
Development

No branches or pull requests

8 participants