-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[agent-smith] account for egress traffic #4677
Conversation
Codecov Report
@@ Coverage Diff @@
## main #4677 +/- ##
=========================================
+ Coverage 0 26.92% +26.92%
=========================================
Files 0 5 +5
Lines 0 739 +739
=========================================
+ Hits 0 199 +199
- Misses 0 518 +518
- Partials 0 22 +22
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a read through the code, currently deploying, will try then
t := value.(time.Time) | ||
infr, err := agent.checkEgressTrafficCallback(p, t) | ||
if err != nil { | ||
return true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reckon at least some debug logging would be handy here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A log here would be extremely verbose, I can add one with a non logged by default priority.
Keep in mind that this is done for every found process every 30 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point. We could add a metric here also, to see how these checks behave, akin to the signature check failures.
Awesome! I was able to provoke agent smith to step in on excessive egress |
@@ -47,6 +51,7 @@ type Smith struct { | |||
|
|||
notifiedInfringements *lru.Cache | |||
perfHandler chan perfHandlerFunc | |||
pidsMap sync.Map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can probably expose a prometheus metrics to see how this grows, wdyt @csweichel ?
/werft run 👍 started the job as gitpod-build-lf-agent-smith-egress-policing.16 |
0168e35
to
e8d313b
Compare
03094bc
to
7ebfcf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit
4fd86f7
to
0d6dede
Compare
/werft run 👍 started the job as gitpod-build-lf-agent-smith-egress-policing.22 |
0d6dede
to
6be1f3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. Thank you 🚀
/werft run 👍 started the job as gitpod-build-lf-agent-smith-egress-policing.24 |
6be1f3b
to
da335ba
Compare
I'd love to do some performance tuning before merging. I noticed that cpu usage is quite high compared to other services
I need to understand if this is because of a resource leak or because of the mole of data we process. And I added a counter of active pids that seems that we are currently processing many of them at once
In the case above |
This is the top of agent smith in the current situation (before this change), there's probably something wrong - the whole pids processing pipeline needs to be improved.
|
/werft run 👍 started the job as gitpod-build-lf-agent-smith-egress-policing.27 |
Made a change so that only supervisor processes are taken in account for checking for egress. This reflects into a performance improvement because we handle 1 pid instead of thousands. This can be easily seen in the metrics
This metric says, one workspace was started |
Resource usage seems also better now, will investigate further optimizations.
|
Just tested against the current main. The cpu usage was not introduced by this PR, main already has it.
Will investigate on main. |
I reckon we can decouple the CPU use of agent smith and this PR, considering how unoptimised the signature checks are. |
Agent Smith found the excessive egress event, but failed to stop my workspace.
|
That seems very strange because I didn't change anything in that code path since your last review @csweichel - is that happening consistently? I just tried it again now and it worked for me, maybe there's something I'm not considering? |
That is most odd indeed. Let's get this in and observe the behaviour in the real world. |
Introducing the first iteration of a mechanism to count excessive egress traffic over a period of time. Based on the original code from @csweichel - the only difference is the way we account for processes using bpf.
This is what happens when a workspace uses too much egress.