Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add /api/fleet/agents/:id/audit/unenroll endpoint #3818

Merged
merged 6 commits into from
Aug 20, 2024

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Aug 13, 2024

What is the problem this PR solves?

Uninstalled agents appear as offline in the fleet UI.

How does this PR solve the problem?

Add /api/fleet/agents/:id/audit/unenroll API that an elastic-agent or Endpoint process may use to annotate the agent document so the agent may appear with a different status.

Changes from RFC

RFC here

Changes are:

  • 409 responses will include a JSON body to be consistent with other errors
  • Successfull requests will alter the following agent document attributes:
    • audit_unenrolled_time - time from request
    • audit_unenrolled_reason - reason from request
    • unenrolled_at - server current time UTC
    • updated_at - server current time UTC (new)
    • active - set to false (new) removed change, was causing the api key auth to fail breaking integration and e2e tests

How to test this PR locally

make test-e2e will run e2e tests as we need to change the elastic-agent or other test tools in order to use the endpoint

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Aug 13, 2024
@michel-laterman michel-laterman requested a review from a team as a code owner August 13, 2024 23:17
Copy link
Contributor Author

@michel-laterman michel-laterman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'll also need to make a change to the ES templates in order to add audit_unenrolled_time and audit_unenrolled_reason to the agent documents. (edit: pr here)

I'll also make a followup PR to change the behaviour of the checkin API to clear the attributes that are set by the audit/unenroll API to allow Endpoint to mark itself as "orphaned" temporarily.

@@ -53,3 +53,7 @@ server_limits:
interval: 1ms
burst: 2000
max: 4000
audit_unenroll_limit:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default/env limits for the audit/unenroll endpoint are set to the same values as the ack endpoint (excluding body which is set to 1 kb)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

Just a few comments, no blockers.

}

if err := audit.bulk.Update(ctx, dl.FleetAgents, agent.Id, body, bulk.WithRefresh(), bulk.WithRetryOnConflict(3)); err != nil {
return fmt.Errorf("auditUnenroll update: %w", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this will return a 500 error back to the caller, is that the correct error in the case its a conflict? Should it instead be the same conflict error in the case the field is set? I don't know if we want the caller to error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -349,7 +349,7 @@ key: %s`,
cfg.TLS = tlsCFG

st := NewStatusT(cfg, nil, nil)
srv := NewServer(addr, cfg, nil, nil, nil, nil, st, sm, fbuild.Info{}, nil, nil, nil, nil, nil)
srv := NewServer(addr, cfg, nil, nil, nil, nil, st, sm, fbuild.Info{}, nil, nil, nil, nil, nil, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has become a ton of parameters will most being nil. Switching to a struct as a parameter would be better and cleaner. (don't do it in this PR, but an overall improvement)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #3823 to track

@@ -53,3 +53,7 @@ server_limits:
interval: 1ms
burst: 2000
max: 4000
audit_unenroll_limit:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


for _, endpoint := range (&cfg.Inputs[0].Server).BindEndpoints() {
apiServer := api.NewServer(endpoint, &cfg.Inputs[0].Server, ct, et, at, ack, st, sm, f.bi, ut, ft, pt, bulker, tracer)
apiServer := api.NewServer(endpoint, &cfg.Inputs[0].Server, ct, et, at, ack, st, sm, f.bi, ut, ft, pt, auditT, bulker, tracer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct would be much better, this is a lot...

internal/pkg/api/error.go Outdated Show resolved Hide resolved
Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a typo and a question about an error message (non-blocking as it can also be fixed in a follow-up PR), looks good otherwise...

+1 on refactoring long lists of parameters into structs though... in those places the code is not very readable 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants