[Fleet]: Agents installed with Default fleet server agent after setting multiple other Fleet Servers goes online and offline repeatedly. #25940

amolnater-qasource · 2021-05-27T09:59:46Z

Kibana version: 7.14.0 Snapshot Kibana cloud environment

Host OS and Browser version: All, All

Build Details:

  Artifact link used: https://snapshots.elastic.co/7.14.0-f385fee6/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip
  Build: 41264
  Commit: 58f7eff1ce01e9ae850803e092d0b30ee7251849

Preconditions:

7.14.0 Snapshot Kibana cloud environment should be available.
Windows 10 x64 Fleet Server agent must be installed on 7.14.0 snapshot with New Policy having Fleet Server Integration.

Steps to reproduce:

Login to Kibana environment.
Add url for self created Fleet Server Agent "https://10.0.x.x:8220" under Fleet Settings.
Install another agent with Default Fleet Server Agent url.
Observe agent is installed successfully.
After few minutes observe agent going inactive and agent is trying to connect to all the Fleet Servers available in Fleet Settings.
Observe below errors under logs tab:

{"log.level":"error","@timestamp":"2021-05-27T02:43:37.042-0400","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://localhost:8220/api/fleet/agents/7587872d-7351-41af-b9d2-da9ff6976e0a/checkin?\": dial tcp 127.0.0.1:8220: connectex: No connection could be made because the target machine actively refused it.","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-05-27T02:45:29.441-0400","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://10.0.5.84:8220/api/fleet/agents/7587872d-7351-41af-b9d2-da9ff6976e0a/checkin?\": x509: certificate signed by unknown authority","ecs.version":"1.6.0"}

Expected Result:
Agents should remain healthy(active) throughout.

Logs:
Logs.zip

Note:

This issue is also observed on 7.13.0 released build.

Screenshots:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-05-27T09:59:47Z

Pinging @elastic/fleet (Team:Fleet)

amolnater-qasource · 2021-05-27T10:00:03Z

@dikshachauhan-qasource Please review.

dikshachauhan-qasource · 2021-05-27T10:21:45Z

Reviewed and assigned to @EricDavisX

EricDavisX · 2021-05-29T20:43:48Z

Thanks for logging. I've been investigating the same area, and am setting up the Demo / Test server to help us test this routinely.

I see from your notes and screenshots that:

In step 2: Add url for self created Fleet Server Agent "10.0.x.x:8220" under Fleet Settings.

you kept the cloud-hosted FS Agent as well as adding one more real one, and adding another 'localhost' FS

In step 4) After few minutes observe agent going inactive and agent is trying to connect to all the Fleet Servers available in Fleet Settings.

is the Agent in this step a Fleet Server Agent on 'New Policy 02' or is it a non-Fleet Server running Agent? An Agent running Fleet Server should not itself be 'enrolled' with "./elastic-agent install -url". If this is the case it points to some error handling we can do better to error up front, perhaps.

I will report back more findings myself, let us know, @dikshachauhan-qasource thanks.

amolnater-qasource · 2021-05-31T06:14:38Z

Hi @EricDavisX
Thanks for looking into this.

you kept the cloud-hosted FS Agent as well as adding one more real one, and adding another 'localhost' FS

Yes

is the Agent in this step a Fleet Server Agent on 'New Policy 02' or is it a non-Fleet Server running Agent?

This is a non-Fleet Server agent, the elastic-agent that is installed with the first URL, that is the cloud hosted FS Agent.
Note: During this elastic-agent installation we had other FS agents as well running on the Kibana.

Further, as per logs we observed that the new secondary agent[ i.e elastic-agent] tries to connect to all the existing Fleet Servers. When it fails to communicate with other Fleet Server Agents, this secondary agent gets "Offline".
As soon as it reconnects to the cloud hosted FS agent it again gets "Healthy".

Thanks
QAS

ruflin · 2021-05-31T09:33:08Z

Is my understanding correct that this scenario mixes an on prem and hosted fleet-server? Even though it should work I remember we mentioned that at the moment it is not something we support (@ph @mostlyjason please clarify if this is not correct).

Having said that, we should still dig into this to understand what happens as we should support this in the future.

mostlyjason · 2021-06-01T18:29:44Z

I think we talked about not supporting this case, but I'm not sure we made a decision on it. The behavior we defined here allows for multiple hosts elastic/kibana#89442 (comment). It says "The Elastic Agent will iterate through URLs until it connects to one successfully. This allows for automatic failover and subnets."

It looks like it only tried a single host in 41 seconds and then switched to the degraded status. It's great that it eventually switches to a healthy host, but it seems unexpected that it changes the status in between. It seems like the logic should be that it tries all the hosts and only switches to degraded if none succeed? Could we just treat this as a bug?

{"log.level":"info","@timestamp":"2021-05-26T23:23:17.718-0700","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-05-26T23:23:17-07:00 - message: Application: metricbeat--7.14.0-SNAPSHOT--36643631373035623733363936343635[f422f8a3-2003-44ce-ba97-c36b2fe59468]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2021-05-26T23:23:17.942-0700","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post \"https://10.0.5.84:8220/api/fleet/agents/f422f8a3-2003-44ce-ba97-c36b2fe59468/checkin?\": x509: certificate signed by unknown authority","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-05-26T23:23:58.072-0700","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-05-26T23:23:58-07:00 - message: Application: endpoint-security--7.14.0-SNAPSHOT[f422f8a3-2003-44ce-ba97-c36b2fe59468]: State changed to DEGRADED: Missed last check-in - type: 'STATE' - sub_type: 'RUNNING'","ecs.version":"1.6.0"}

EricDavisX · 2021-06-09T20:20:00Z

I think it is important to support this for 2 reasons:

it will cause more time for testing the basic scenario manually until we get it automated.
not sure we can assume customers won't do this to, see issue here, which notes customers sometimes intentionally test on-prem being moving to the cloud: [Fleet] Add an 'Enroll' command option to the Agent action menu to facilitate moving an Agent from a 'test' Fleet-Server to 'prod' for instance. kibana#101830

EricDavisX · 2021-06-17T01:10:29Z

Another mention in slack today of users who want to set up a Fleet Server in the local intranet and then one separately on a DMZ (low side) from Patrick Boulanger @pboulanger74

ph · 2021-06-22T16:03:56Z

@blakerouse This seems to be a recurrent issue could you take a look? The report date a few weeks ago but the last comment from Eric is from last week.

Not sure what is the issue is here, we need to take a serious investigation at hi @nimarezainia @urso @ruflin

amolnater-qasource · 2021-06-23T10:13:45Z

Hi @EricDavisX
We have revalidated this issue on 7.14.0 Snapshot and found it not reproducible today.

Agents installed with Default fleet server agent after setting multiple other Fleet Servers remains Healthy.

Build details:

Build: 41896
Commit: e26582638988179d134e77e59b66ed8f982ab064
Artifact link: https://snapshots.elastic.co/7.14.0-df0371f0/downloads/beats/elastic-agent/elastic-agent-7.14.0-SNAPSHOT-windows-x86_64.zip

Hence we are closing this out. However, if in further testing we observed this issue again we will reopen this.

Thanks
QAS

EricDavisX · 2021-06-28T23:14:49Z

see above, i logged a proper place-holder for this work as this one captured only one part of the puzzle, and was closed when that one bug was fixed. ;)

amolnater-qasource added bug impact:high Short-term priority; add to current release, or definitely next. Team:Fleet Label for the Fleet team labels May 27, 2021

amolnater-qasource assigned dikshachauhan-qasource May 27, 2021

dikshachauhan-qasource assigned EricDavisX and unassigned dikshachauhan-qasource May 27, 2021

EricDavisX removed their assignment May 29, 2021

amolnater-qasource added the v7.14.0 label Jun 10, 2021

ph assigned blakerouse Jun 22, 2021

ph added the P1 label Jun 22, 2021

ph mentioned this issue Jun 22, 2021

elastic-agent: HEALTHY status fluctuates for fleet-server #25341

Closed

amolnater-qasource closed this as completed Jun 23, 2021

EricDavisX mentioned this issue Jun 28, 2021

[Fleet Server] [Agent] Support "mixed environments" of cloud stack + local (self-managed) Fleet Servers #26550

Closed

mostlyjason mentioned this issue Jul 1, 2021

Support direct citation (not round-robin method) to select which Fleet Server an Agent will connect to elastic/fleet-server#497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet]: Agents installed with Default fleet server agent after setting multiple other Fleet Servers goes online and offline repeatedly. #25940

[Fleet]: Agents installed with Default fleet server agent after setting multiple other Fleet Servers goes online and offline repeatedly. #25940

amolnater-qasource commented May 27, 2021

elasticmachine commented May 27, 2021

amolnater-qasource commented May 27, 2021

dikshachauhan-qasource commented May 27, 2021

EricDavisX commented May 29, 2021

amolnater-qasource commented May 31, 2021 •

edited

Loading

ruflin commented May 31, 2021

mostlyjason commented Jun 1, 2021 •

edited

Loading

EricDavisX commented Jun 9, 2021

EricDavisX commented Jun 17, 2021

ph commented Jun 22, 2021

amolnater-qasource commented Jun 23, 2021

EricDavisX commented Jun 28, 2021

[Fleet]: Agents installed with Default fleet server agent after setting multiple other Fleet Servers goes online and offline repeatedly. #25940

[Fleet]: Agents installed with Default fleet server agent after setting multiple other Fleet Servers goes online and offline repeatedly. #25940

Comments

amolnater-qasource commented May 27, 2021

elasticmachine commented May 27, 2021

amolnater-qasource commented May 27, 2021

dikshachauhan-qasource commented May 27, 2021

EricDavisX commented May 29, 2021

amolnater-qasource commented May 31, 2021 • edited Loading

ruflin commented May 31, 2021

mostlyjason commented Jun 1, 2021 • edited Loading

EricDavisX commented Jun 9, 2021

EricDavisX commented Jun 17, 2021

ph commented Jun 22, 2021

amolnater-qasource commented Jun 23, 2021

EricDavisX commented Jun 28, 2021

amolnater-qasource commented May 31, 2021 •

edited

Loading

mostlyjason commented Jun 1, 2021 •

edited

Loading