Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix leaks with metadata processors #16349

Merged
merged 7 commits into from
Oct 13, 2020

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Feb 14, 2020

What does this PR do?

Add a closer interface for processors so their resources can be released when the processor is not needed anymore.

Explicitly close publisher pipelines so their processors are closed.

Add closers for add_docker_metadata, add_kubernetes_metadata and add_process_metadata.

Why is it important?

Processors can be defined in dynamic configurations, as in autodiscover templates. Some processors open connections, files, or start goroutines that need to be closed or stopped.

Processors can be dynamically created at least when using autodiscover and when used in the script processor. When this issue appears it can be usually solved by configuration (as described in #14389). So this case is probably not so serious.

Processors can include an on-disk cache since #20775, files open by this cache should be properly closed on shutdown.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works

Author's Checklist

Check that resources are released in:

  • Filebeat
  • Metricbeat
    • Processors at the module level are open multiple times? (One per metricset, to be reviewed as a different issue, only a real problem when a processor cannnot be instantiated twice with the same config, what only happens with add_cloudfoundry_metadata)
  • Autodiscover
  • Config reload
  • Check if there are other different pipelines that use processors in other beats (Everything using libbeat publisher pipeline should be covered, to open new issues if something that is not covered by this assumption should be covered).
  • Processors used in the script processor. Options to solve this:
    • Somehow call close on all the instantiated processors once the script is finished.
    • Avoid registering processors implementing Closer for the script processor.

Add tests:

  • Native processors with closer in script processor.
  • Processing pipelines created by supporters don't close the global processors.

Related issues

Use cases

  • Add processors in dynamic configuration and release its resources when not needed anymore.
  • Use processors that make use of files that should be closed (as cache on disk used by add_cloudfoundry_metadata).

@jsoriano jsoriano added bug [zube]: In Progress Team:Services (Deprecated) Label for the former Integrations-Services team labels Feb 14, 2020
@jsoriano jsoriano self-assigned this Feb 14, 2020
@jsoriano jsoriano force-pushed the autodiscover-processors-leak branch 2 times, most recently from 315a8d6 to faaa97f Compare February 20, 2020 19:30
@jsoriano jsoriano added the discuss Issue needs further discussion. label Feb 21, 2020
@jsoriano jsoriano force-pushed the autodiscover-processors-leak branch 2 times, most recently from b2ca802 to a4860e2 Compare February 21, 2020 14:00
@jsoriano
Copy link
Member Author

@andrewkroh @exekias I would like to have your opinions on this.

Summarizing, I think this issue would only affect add_docker_metadata and add_kubernetes_metadata, as they start watchers that need their own logistics.

The issue can be reproduced by using one of these processors in an autodiscover template, or in a script processor. But actually I cannot think on a use case for this. Metadata is added in any case in autodiscover, and even if needed these processors could be placed in the main static configuration. Regarding scripts, I don't think it is useful at all to call these processors from scripts (but we are exposing them at the moment, and they can be used as in the test case added here).

I started this PR to add closers to the processors that need them, and call these closers from all the places where they are instantiated. The problem I see with this approach:

  • We allow processors in many places, so a lot of things need to be touched.
  • Does it worth it? It only happens with some special processors, and when happens is because they are being used incorrectly. Calling the closers on scripts can be tricky, we would be creating and destroying heavy processors for every event.

I am thinking on continuing adding the closers support in common cases, as it may be also useful for other processors in the future, but adding a check in the script processor so processors that define a closer cannot be registered (this would imply to stop registering these processors for the script processor, what would be a breaking change, but I guess nobody is using them).

An alternative would be to do nothing, as this is something that can be solved with a proper configuration. Though it'd be nice to don't allow the problem to happen.

Opinions?

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its a good idea build in support for Close()ing processors. We have processors holding resources/state that need to be released so we need to release them. The script processor should disallow constructing stateful processors (more on this below).

For the script processor, I would consider it an error to instantiate any processor during the process() invocation. Any processors that are needed should be constructed once at initialization. So I wouldn't try to handle releasing resources that are allocated for every invocation of the script processor in the process() function. But we could release the resources associated with the processors constructed at initialization. We control their constructors so we can track what needs to be Close()ed in a list.

BUT I made an assumption that all processors were stateless (hey, that's how their interface is designed). So the script processor uses a sync.Pool containing javascript VMs to enable concurrent usage of the processor (a VM instance supports only one invocation at a time). Because it uses a sync.Pool, instances can be garbage collected any time they are not being used. This means we don't have a way to invoke Close() on the processors when an instance in GC'ed. You'd have to use something other than sync.Pool to make this all work. So it's easier to just disallow stateful processors from scripts.

libbeat/publisher/processing/processors.go Outdated Show resolved Hide resolved
@jsoriano
Copy link
Member Author

Any processors that are needed should be constructed once at initialization.

Oh, I don't know why I was thinking that the whole script was executed every time the processor is run. Then I think it would be ok to implement a closer for the script processor that closes the processors that need it.

instances can be garbage collected any time they are not being used.

Do you mean that the instance can be garbage collected even when the processor is still in use? Would that mean that if the processor is run again, it will run the initialization again?

@jsoriano
Copy link
Member Author

instances can be garbage collected any time they are not being used.

Do you mean that the instance can be garbage collected even when the processor is still in use? Would that mean that if the processor is run again, it will run the initialization again?

Oh, ok, I understand it now looking at the code. Then I think that I will remove by now stateful processors from the registry and we can try in the future to look for an alternative to sync.Pool.

Thanks!

@andrewkroh
Copy link
Member

I was still thinking about this issue for some reason 😆 and have another consideration that I haven't vetted. This comes into play if you plan on closing the processors associated with publisher clients. It's about this sharing of global processors in publisher pipelines. Anecdotally, I think global processors get shared by each publisher client's processor pipeline. If that is the case then Close()ing the processors from one client could affect another client. The publisher client would need to have two lists of processors (global, local) instead of one list to be able to free the right resources.

@jsoriano
Copy link
Member Author

I was still thinking about this issue for some reason 😆

Thanks! 😂

I think global processors get shared by each publisher client's processor pipeline. If that is the case then Close()ing the processors from one client could affect another client.

Yes, I am afraid that you are right and in some places processors are reused, but I am not sure if only in static configurations, I don't see it happening in autodiscover for example.

I was also taking a look to the code and there are several places where we create processors, I think that supporting stateful processors properly is going to be a bigger refactor than I originally expected. As you mentioned, their interface is designed to be stateless (only New and Run methods), so actually just adding a Closer interface may be tricky as it wouldn't match with such interface. Also it is going to be difficult to control when to call these closers if we are reusing them and using them in contexts where we cannot control their finalization. I was thinking on using some kind of context or done channel that we can propagate, so processors are closed when their creator is closed, but we would need to pass it to New, what would still be a bit counterintuitive in this interface, we would need some kind of Open/Start method.

Another option for refactor would be to re-think the processors with state, maybe they should be a different kind of enrichers. But this would open other different rabbit holes for sure.

I have been talking with @exekias too about this, and we think that we might by now try to address this issue by avoiding the cases where this can be more problematic, and leave further refactors for the future. The solutions to implement by now would be:

  • Stop registering add_docker_metadata and add_kubernetes_metadata for the script processor.
  • Reject add_docker_metadata and add_kubernetes_metadata processors when validating configs in autodiscover. This would be hacky, but would prevent issues like Filebeat too many open files using Docker autodiscover #14389 to happen.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-platforms (Team:Platforms)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Sep 23, 2020

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [Pull request #16349 updated]

  • Start Time: 2020-10-13T09:11:34.058+0000

  • Duration: 84 min 56 sec

Test stats 🧪

Test Results
Failed 0
Passed 16325
Skipped 1343
Total 17668

@jsoriano
Copy link
Member Author

jsoriano commented Oct 13, 2020

add_process_metadata also has a cache that needs to be stopped, I have added a closer that is only used when the cache is used. In the script processor the cache is disabled so the processor can still be used there.

@exekias
Copy link
Contributor

exekias commented Oct 13, 2020

Nice work here! LGTM

@jsoriano
Copy link
Member Author

Thanks for the feedback and the reviews!

@jsoriano jsoriano merged commit a3fe796 into elastic:master Oct 13, 2020
@jsoriano jsoriano deleted the autodiscover-processors-leak branch October 13, 2020 11:08
@jsoriano jsoriano removed the needs_backport PR is waiting to be backported to other branches. label Oct 13, 2020
jsoriano added a commit to jsoriano/beats that referenced this pull request Oct 13, 2020
Add a closer interface for processors so their resources can be released
when the processor is not needed anymore.
Explicitly close publisher pipelines so their processors are closed.
Add closers for add_docker_metadata, add_kubernetes_metadata and add_process_metadata.
Script processor will fail if a processor that needs to be closed is used.

(cherry picked from commit a3fe796)
jsoriano added a commit to jsoriano/beats that referenced this pull request Oct 13, 2020
Add a closer interface for processors so their resources can be released
when the processor is not needed anymore.
Explicitly close publisher pipelines so their processors are closed.
Add closers for add_docker_metadata, add_kubernetes_metadata and add_process_metadata.
Script processor will fail if a processor that needs to be closed is used.

(cherry picked from commit a3fe796)
v1v added a commit to v1v/beats that referenced this pull request Oct 13, 2020
* upstream/master: (127 commits)
  Update obs app links (elastic#21682)
  fix: update fleet test suite name (elastic#21738)
  Remove dot from file.extension value in Auditbeat FIM (elastic#21644)
  Fix leaks with metadata processors (elastic#16349)
  Add istiod metricset (elastic#21519)
  [Ingest Manager] Change Sync/Close call order (elastic#21735)
  [Ingest Manager] Syncing unpacked files (elastic#21706)
  Fix concurrent map read and write in socket dataset (elastic#21690)
  Fix conditional coding to remove seccomp info from Winlogbeat (elastic#21652)
  [Elastic Agent] Fix issue where inputs without processors defined would panic (elastic#21628)
  Add configuration of filestream input (elastic#21565)
  libbeat/logp: introduce Logger.WithOptions (elastic#21671)
  Make o365audit input cancellable (elastic#21647)
  fix: remove extra curly brace in script (elastic#21692)
  [Winlogbeat] Remove brittle configuration validation from wineventlog (elastic#21593)
  Fix function that parses from/to/contact headers (elastic#21672)
  [CI] Support Windows-2016 in pipeline 2.0 (elastic#21337)
  Skip publisher flaky tests (elastic#21657)
  backport: add 7.10 branch (elastic#21635)
  [CI: Packaging] fix: push ubi8 images too (elastic#21621)
  ...
jsoriano added a commit that referenced this pull request Oct 13, 2020
Add a closer interface for processors so their resources can be released
when the processor is not needed anymore.
Explicitly close publisher pipelines so their processors are closed.
Add closers for add_docker_metadata, add_kubernetes_metadata and add_process_metadata.
Script processor will fail if a processor that needs to be closed is used.

(cherry picked from commit a3fe796)
jsoriano added a commit that referenced this pull request Oct 13, 2020
Add a closer interface for processors so their resources can be released
when the processor is not needed anymore.
Explicitly close publisher pipelines so their processors are closed.
Add closers for add_docker_metadata, add_kubernetes_metadata and add_process_metadata.
Script processor will fail if a processor that needs to be closed is used.

(cherry picked from commit a3fe796)
v1v added a commit to v1v/beats that referenced this pull request Oct 14, 2020
…dependencies-goals

* upstream/master: (46 commits)
  Use badger code from upstream repository (elastic#21705)
  Disable writes sync in persistent cache (elastic#21754)
  Make API address and Shard ID required in Cloud Foundry settings (elastic#21759)
  [CI] Support skip-ci label (elastic#21377)
  Increase recommended memory when deploying in Cloud foundry (elastic#21755)
  typofix for dns timeout configuration option (elastic#21069)
  chore: create CI artifacts for DEV usage (elastic#21645)
  [Ingest Manager] Atomic installed forgotten changelog elastic#21780
  [Ingest Manager] Agent atomic installer (elastic#21745)
  Add missing configuration annotations (elastic#21736)
  [Filebeat] Add check for context.DeadlineExceeded error (elastic#21732)
  Remove kafka partition ISR from Metricbeat docs (elastic#21709)
  Skip flaky test with oauth2 config in httpjson input (elastic#21749)
  Fix for azure retrieve resource by ids (elastic#21711)
  Update obs app links (elastic#21682)
  fix: update fleet test suite name (elastic#21738)
  Remove dot from file.extension value in Auditbeat FIM (elastic#21644)
  Fix leaks with metadata processors (elastic#16349)
  Add istiod metricset (elastic#21519)
  [Ingest Manager] Change Sync/Close call order (elastic#21735)
  ...
v1v added a commit to v1v/beats that referenced this pull request Oct 14, 2020
…ne-2.0-arm

* upstream/master: (93 commits)
  Fix non-windows fields on system/filesystem (elastic#21758)
  disable TestReceiveEventsAndMetadata/TestSocketCleanup/TestReceiveNewEventsConcurrently in Windows (elastic#21750)
  Use badger code from upstream repository (elastic#21705)
  Disable writes sync in persistent cache (elastic#21754)
  Make API address and Shard ID required in Cloud Foundry settings (elastic#21759)
  [CI] Support skip-ci label (elastic#21377)
  Increase recommended memory when deploying in Cloud foundry (elastic#21755)
  typofix for dns timeout configuration option (elastic#21069)
  chore: create CI artifacts for DEV usage (elastic#21645)
  [Ingest Manager] Atomic installed forgotten changelog elastic#21780
  [Ingest Manager] Agent atomic installer (elastic#21745)
  Add missing configuration annotations (elastic#21736)
  [Filebeat] Add check for context.DeadlineExceeded error (elastic#21732)
  Remove kafka partition ISR from Metricbeat docs (elastic#21709)
  Skip flaky test with oauth2 config in httpjson input (elastic#21749)
  Fix for azure retrieve resource by ids (elastic#21711)
  Update obs app links (elastic#21682)
  fix: update fleet test suite name (elastic#21738)
  Remove dot from file.extension value in Auditbeat FIM (elastic#21644)
  Fix leaks with metadata processors (elastic#16349)
  ...
@jsoriano jsoriano mentioned this pull request Dec 10, 2020
6 tasks
@zube zube bot removed the [zube]: Done label Jan 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug discuss Issue needs further discussion. Team:Platforms Label for the Integrations - Platforms team Team:Services (Deprecated) Label for the former Integrations-Services team v7.10.0 v7.11.0
Projects
None yet
6 participants