Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auditbeat] Memory leak in 7.5.2 #16879

Closed
wants to merge 19 commits into from

Conversation

andrewstucki
Copy link

@andrewstucki andrewstucki commented Mar 6, 2020

What does this PR do?

  1. Adds cache timeout policies for sockets, processes, and the short-term thread cache for tracing kprobe call/return pairs.
  2. Refactors a lot of the auditbeat code so it's a bit more ergonomic.

Why is it important?

Without the timeout-based cache eviction, we're leaking memory.

Checklist

  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works

Related issues

Original Issue at: https://discuss.elastic.co/t/auditbeat-memory-leak-in-7-5-2/218335

@elasticmachine
Copy link
Collaborator

Pinging @elastic/siem (Team:SIEM)

@andrewstucki andrewstucki changed the title Socket timeouts [Auditbeat] Memory leak in 7.5.2 Mar 6, 2020
Copy link
Contributor

@adriansr adriansr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, but there's some issues.

I'm also worried about process expiration. Expiring a socket or flow is not an issue because further traffic will just re-create them. The worst that can happen is to have a network flow split in two output events.

But with processes, once they expire they won't be recreated and we lose the ability to enrich flows with process information. So for example any process that doesn't perform network i/o for 30s will not be detected.

Given that PIDs, unlike socket pointers, are reused, if they're not expired the cache will max out at whatever the system limit is (/proc/sys/kernel/pid_max seems to default at 32768). So I think maybe it's safe to not expire processes, WDYT?

x-pack/auditbeat/module/system/socket/state/state.go Outdated Show resolved Hide resolved
}

func NewState(r mb.PushReporterV2, log helper.Logger, processTimeout, inactiveTimeout, closeTimeout, clockMaxDrift time.Duration) *State {
s := makeState(r, log, processTimeout, inactiveTimeout*2, inactiveTimeout, closeTimeout, clockMaxDrift)
Copy link
Contributor

@adriansr adriansr Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is setting the socket timeout to 2*inactiveTimeout, which is config.FlowInactiveTimeout, but at config.go says that InactiveTimeout is used for both sockets and processes.

"github.com/elastic/beats/v7/x-pack/auditbeat/tracing"
)

type Endpoint struct {
Copy link
Contributor

@adriansr adriansr Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All public declarations should have a comment starting with the type/function/variable name:

// Endpoint blah blah.

We have a bot that reviews all this automatically, seems that it's been broken for some weeks now.


func (e *UDPQueueRcvSkbCall) Flow() *common.Flow {
var remote *common.Endpoint
if valid := validIPv4Headers(e.IPHdr, e.UDPHdr, e.Packet[:]); valid {
Copy link
Contributor

@adriansr adriansr Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that during the refactor the logic in here was messed up. The second call to validIPv4Headers below should only happen if the first call returns false. It's a workaround for the case where e.IPHdr/e.UDPHdr are (the lowest 16bit) of pointers instead of indices inside e.Packet.

@adriansr
Copy link
Contributor

Also, it needs an entry in CHANGELOG.next.asciidoc and a rebase to pass CI.

@andrewstucki
Copy link
Author

rebased from #17500

@botelastic
Copy link

botelastic bot commented Jul 17, 2020

Hi!
We just realized that we haven't looked into this PR in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it in as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1.
Thank you for your contribution!

@botelastic botelastic bot added the Stalled label Jul 17, 2020
@botelastic
Copy link

botelastic bot commented Aug 16, 2020

Hi!
This PR has been stale for a while and we're going to close it as part of our cleanup procedure.
We appreciate your contribution and would like to apologize if we have not been able to review it, due to the current heavy load of the team.
Feel free to re-open this PR if you think it should stay open and is worth rebasing.
Thank you for your contribution!

@botelastic botelastic bot closed this Aug 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Auditbeat] Memory leak in 7.5.2
3 participants