Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance issues with processors scaling under agent #35031

Merged

Conversation

fearful-symmetry
Copy link
Contributor

What does this PR do?

Fixes #35000

Related to #34149

This fixes a performance issue where we were previously starting agent mode global processors per-input, which could cause some performance issues in cases where we have lots of inputs starting and stopping.

This does this by adding a fleetDefaultProcessors argument to the MakeDefaultSupport function that's used throughout the beats to instantiate processors. Under fleet mode, this function will now use the specified default global processors, unless processors have been manually specified. This extra bit of logic allows us to disable global processors under fleet, which previously wasn't possible.

While I have tested this, there's a few conditions I haven't tested:

I'm not sure how to reliably test these two cases, so if someone wants to tell me, or test it themselves, go ahead.

Why is it important?

This is a major performance issue.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@fearful-symmetry fearful-symmetry added bug Team:Elastic-Agent Label for the Agent team labels Apr 5, 2023
@fearful-symmetry fearful-symmetry self-assigned this Apr 5, 2023
@fearful-symmetry fearful-symmetry requested review from a team as code owners April 5, 2023 20:28
@fearful-symmetry fearful-symmetry requested review from rdner and cmacknz and removed request for a team April 5, 2023 20:28
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Apr 5, 2023
@mergify
Copy link
Contributor

mergify bot commented Apr 5, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @fearful-symmetry? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Apr 5, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-04-12T12:33:59.507+0000

  • Duration: 95 min 53 sec

Test stats 🧪

Test Results
Failed 0
Passed 26109
Skipped 1978
Total 28087

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@cmacknz cmacknz requested review from blakerouse and andrewkroh April 5, 2023 22:20
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make it so that the default processors appear in the beat-rendered-config.yml dumped out in the diagnostics?

I don't see anything in here that would add them, and people expect that file to be a complete list of everything the beat is running.

@@ -15,8 +15,7 @@ import (
)

func filebeatCfg(rawIn *proto.UnitExpectedConfig, agentInfo *client.AgentInfo) ([]*reload.ConfigWithMeta, error) {
procs := defaultProcessors()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth moving the defaultProcessors function into the file that uses it, since it isn't actually used here anymore.

This comment applies to each of the beats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, good point, was kinda unsure of how to clean those up.

@fearful-symmetry
Copy link
Contributor Author

Can you make it so that the default processors appear in the beat-rendered-config.yml dumped out in the diagnostics?

I'm not sure there's a particularly "clean" way to do that at that anymore, there's no real state sharing between the central management components and the core beat anymore, at least not that I can see. The easiest way to do that would be expand the Manager interface in libbeat/management to have a RegisterDiagnosticHook method, so the core beat runtime can dump state. Not sure if we want to put that in the scope of this PR?

@cmacknz
Copy link
Member

cmacknz commented Apr 5, 2023

The easiest way to do that would be expand the Manager interface in libbeat/management to have a RegisterDiagnosticHook method, so the core beat runtime can dump state. Not sure if we want to put that in the scope of this PR?

This sounds like a reasonable approach. This is always going to feel a bit ugly because we are dealing with global state far away from the management interface, but if we want proper input and output status reporting back to agent we are going to be doing a lot of more of this in the future regardless.

I think it's important not to regress on what is visible in the diagnostics. The complete state of the Beat should be visible or someone is extremely likely to waste time making an assumption that it is complete when it is not.

Copy link
Contributor

@michalpristas michalpristas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks good, i'm just thinking, is there a way to benchmark pipeline?
i'd like to see that perf actually improved after merging this

@jlind23
Copy link
Collaborator

jlind23 commented Apr 6, 2023

@alexsapran Does your benchmarks contains some processors applied? If yes this should be easily testable.

@alexsapran
Copy link
Contributor

@alexsapran Does your benchmarks contains some processors applied? If yes this should be easily testable.

I fear not, from what I understand the test case here has many inputs with some processors right?
My benchmarks contain 1 large file. I could try and put something together but I maybe if @fearful-symmetry can assist me with the exact conditions to test.

@fearful-symmetry
Copy link
Contributor Author

I fear not, from what I understand the test case here has many inputs with some processors right?

So, based on my understanding (@cmacknz might be able to be more specific) is that the performance bottleneck occurs in cases where the beat creates large numbers of new inputs, which will happen most often with filestream and aws-s3, although theoretically it could be reproduce with any input, if we create enough of them.

@alexsapran
Copy link
Contributor

And what is the performance indicator to identify the improvement. Is it EPS, CPU?

@fearful-symmetry fearful-symmetry requested a review from a team as a code owner April 6, 2023 18:41
@fearful-symmetry
Copy link
Contributor Author

And what is the performance indicator to identify the improvement. Is it EPS, CPU?

@alexsapran I assume CPU; in my experience, just starting a beat with the default processor sets results in a noticeable startup delay, so it probably won't be subtle.

@fearful-symmetry
Copy link
Contributor Author

fearful-symmetry commented Apr 6, 2023

Alright, added a diagnostics callback, it now reports the debug data that the individual processors make available:

add_host_metadata=[netinfo.enabled=[true], cache.ttl=[5m0s]]
add_cloud_metadata={}
add_docker_metadata=[match_fields=[] match_pids=[process.pid, process.parent.pid]]
add_kubernetes_metadata

@fearful-symmetry
Copy link
Contributor Author

fearful-symmetry commented Apr 6, 2023

Annoyingly, can't seem to reproduce the test failures locally. Not sure if they're related.

@fearful-symmetry
Copy link
Contributor Author

/test

Copy link
Contributor

@belimawr belimawr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but there is a comment regarding a debug message.

I'm also not quite sure how to test it. If I understood it correctly:

  1. If I run any Beat under Agent, and there are no global processors defined in the policy, the default ones defined on x-pack/filebeat/cmd/root.go:40 (for Filebeat) will be used
  2. If any global processor is defined in the policy, they should be used.

Is that it?

libbeat/publisher/processing/default.go Outdated Show resolved Hide resolved
@belimawr
Copy link
Contributor

@fearful-symmetry I also took a quick look at the failing tests, they seem to be failing due to some processors not being added when running a standalone Filebeat (x-pack/filebeat-pythonIntegTest), I can easily reproduce it by running the tests locally (multiple tests are failing, this is a single example):

cd x-pack/filebeat/tests/system
INTEGRATION_TEST=1 pytest -s -k 'test_http_endpoint_correct_auth_header' ./test_http_endpoint.py

On main it generates the event:

{
  "@timestamp": "2023-04-11T09:22:09.522Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "8.8.0"
  },
  "json": {
    "testmessage": "somerandommessage"
  },
  "input": {
    "type": "http_endpoint"
  },
  "ecs": {
    "version": "8.0.0"
  },
  "host": {
    "name": "millennium-falcon"
  },
  "agent": {
    "name": "millennium-falcon",
    "type": "filebeat",
    "version": "8.8.0",
    "ephemeral_id": "bb15b979-01db-402f-a593-cae2cc6ddd7b",
    "id": "5687e3bc-c526-4a0c-971f-7392f2928cb6"
  }
}

on your branch:

{
  "@timestamp": "2023-04-11T09:18:50.837Z",
  "@metadata": {
    "beat": "filebeat",
    "type": "_doc",
    "version": "8.8.0"
  },
  "json": {
    "testmessage": "somerandommessage"
  },
  "input": {
    "type": "http_endpoint"
  }
}

Did you make sure to rebuild the test binary with mage buildSystemTestBinary before running the python tests on your branch?

Copy link
Contributor

@belimawr belimawr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just changing the review to 'request changes' because the failing tests seem to be related to this PR.

@fearful-symmetry
Copy link
Contributor Author

Did you make sure to rebuild the test binary with mage buildSystemTestBinary before running the python tests on your branch?

Ah, I didn't know that was a thing I needed to do, might explain why I couldn't reproduce the failures locally.

@fearful-symmetry
Copy link
Contributor Author

Okay, FINALLY managed to reproduce the failures, turns out I just misread the jenkins UI and ran the wrong tests, argh

@fearful-symmetry
Copy link
Contributor Author

Alright, let's see what that does...

@fearful-symmetry
Copy link
Contributor Author

Okay, did some very quick and unscientific tests, it looks like with the patch, metricbeat under agent running 17 inputs uses about half the CPU?

// main
alexk    1293186  3.0  0.1 2090116 121440 pts/69 Sl+  12:42   0:00
// this branch
alexk    1401459  1.5  0.2 1942668 123640 pts/69 Sl+  13:02   0:00 

@fearful-symmetry
Copy link
Contributor Author

Alright, did a little test with pprof to verify the changes in performance. This was agent running ten system/cpu inputs. You can see the docker client (client.(*Client).Events.func1) using about a quarter of the CPU time.

Screenshot 2023-04-11 at 3 32 08 PM

Screenshot 2023-04-11 at 3 32 22 PM

@alexsapran
Copy link
Contributor

I managed to run some benchmarks myself, I will touch base with @fearful-symmetry on the results, but from the initial reading indeed there is a difference in the metadata CPU usage which results in higher EPS.

@pierrehilbert
Copy link
Collaborator

The changelog is missing but I can't contribute on this fork so I will add it separately.

Copy link
Contributor

@belimawr belimawr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I didn't manage to manually test it, but the system tests seem to have a good coverage for this PR.

@pierrehilbert pierrehilbert added the backport-v8.7.0 Automated backport with mergify label Apr 12, 2023
@pierrehilbert pierrehilbert merged commit ea1293f into elastic:main Apr 12, 2023
mergify bot pushed a commit that referenced this pull request Apr 12, 2023
* fix performance issues with processors scaling under agent

* make linter happy

* fix test

* add comment

* move around defaultProcessors

* fix tests, add diagnostics

* fix default processor on filebeat

* change log line

* Update CHANGELOG.next.asciidoc

Adding changelog entry

---------

Co-authored-by: Pierre HILBERT <[email protected]>
(cherry picked from commit ea1293f)

# Conflicts:
#	libbeat/management/management.go
pierrehilbert added a commit that referenced this pull request Apr 12, 2023
… under agent (#35066)

* Fix performance issues with processors scaling under agent (#35031)

* fix performance issues with processors scaling under agent

* make linter happy

* fix test

* add comment

* move around defaultProcessors

* fix tests, add diagnostics

* fix default processor on filebeat

* change log line

* Update CHANGELOG.next.asciidoc

Adding changelog entry

---------

Co-authored-by: Pierre HILBERT <[email protected]>
(cherry picked from commit ea1293f)

# Conflicts:
#	libbeat/management/management.go

* Fixing conflict

---------

Co-authored-by: Alex K <[email protected]>
Co-authored-by: Pierre HILBERT <[email protected]>
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* fix performance issues with processors scaling under agent

* make linter happy

* fix test

* add comment

* move around defaultProcessors

* fix tests, add diagnostics

* fix default processor on filebeat

* change log line

* Update CHANGELOG.next.asciidoc

Adding changelog entry

---------

Co-authored-by: Pierre HILBERT <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.7.0 Automated backport with mergify bug Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Elastic Agent] Default processors created per input can result in high agent CPU usage
9 participants