-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jb: java perf monitoring #9796
jb: java perf monitoring #9796
Conversation
3f08535
to
cba707d
Compare
9469cea
to
110b544
Compare
...nts/ide/jetbrains/backend-plugin/src/main/kotlin/io/gitpod/jetbrains/remote/GitpodManager.kt
Outdated
Show resolved
Hide resolved
I'm testing this PR |
How to check if
|
It is a bit tricky:
|
Tested and works well (Do we need expected (related) result in how to test section?), should we explain what |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold
Code looks good
func Top(ctx context.Context) (*api.ResourcesStatusResponse, error) { | ||
memory, err := resolveMemoryStatus() | ||
if err != nil { | ||
return nil, err | ||
} | ||
cpu, err := resolveCPUStatus() | ||
if err != nil { | ||
return nil, err | ||
} | ||
return &api.ResourcesStatusResponse{ | ||
Memory: memory, | ||
Cpu: cpu, | ||
}, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment the code in ws-daemon is not written to be used by other components. Factoring this out would be equivalent to starting our own cgroup library. Considering that the other options like using containerd's cgroup are awkward to use I still think this is the best way forward. This does not need to be done in this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to relevant code please? 🙏
There also can be slight difference, i.e. I subtract evict-able memory to give a user more usefull data, i.e. simlar to what GCP showing. I am not sure what ws-daemon does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPU:
v1; https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/pkg/cpulimit/cfs.go
v2: https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/pkg/cpulimit/cfs_v2.go
Memory:
v1: https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/pkg/cgroup/plugin_cachereclaim.go
v2: n.a.
IO:
v1: https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/pkg/cgroup/plugin_iolimit_v1.go
v2: https://github.com/gitpod-io/gitpod/blob/main/components/ws-daemon/pkg/cgroup/plugin_iolimit_v2.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should harmonise this with the way ws-daemon does it.
After looking at the code I don't think it can be harmonised without refactoring, ws daemon does not have some functions like getting used memory without evictable caches. Instead it compare page caches after trying to reclaim memory. It also has special cases for different state of workspaces which is not applicable inside a workspace.
I'm not sure refactoring will make such sense trying to create a reusable library right now. Code in ws-daemon is not really source of truth, i.e. there should not be a situation when you change one and need to change another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder also why ws-daemon does not use container runtime endpoints but need to read from the disk?
Thanks for your feedback! Sorry, I couldn't get what the disk
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback! Sorry, I couldn't get what the disk is.
I mean container runtime has such API which can return usage/limit stats for cpu and memory: https://github.com/opencontainers/runc/blob/067aaf8548d78269dcb2c13b856775e27c410f9c/libcontainer/cgroups/stats.go#L158-L161
Supervisor cannot reach this endpoint, so we have to read cgroup from filesystem. I am asking whether ws-daemon can rely on it instead and provide this information to supervisor somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API you have linked would mean to execute runc stat and then parse the output, this is not a service endpoint. We also need to write to the cgroup filesystem not only read from it and we should strive to do so in a runtime agnostic manner. We could ask this information from containerd but if we allow the option to use any CRI compatible runtime in the future (like CRI-O) this would require us to adapt the code again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, sorry i don't really know it and thought opencontainers/runc is an abstraction over different runtimes and using this API oppositely will make things more stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@csweichel Would it be alright to continue as suggested here? Could you approve it please if so? 🙏
/werft run 👍 started the job as gitpod-build-ak-java-perf.41 |
Code looks good, will test it with Prometheus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to unhold it 👍 @akosyakov |
/unhold |
Description
See internal doc for results of perf testing.
This PR adds tooling which can be used by us or end users to control and monitor resources consumptions of workspaces generally and particularly JB backends:
/.supervisor/supervisor top
to monitor resources status based on endpoint[product]_XMX
, i.e.INTELLIJ_XMX=4096M
node ./dev/ide/profile-workspace.js
which capture avg/max/90% percentile of memory and CPU usage as well as max/used JB backend memory for a single workspace running on VMOut of scope
Related Issue(s)
Fixes #9521
How to test
Important only one person can test at the time. So comment below that you are testing and when done remove a comment.
./dev/preview/install-k3s-kubeconfig.sh
to configure k8s context../dev/preview/portforward-monitoring-satellite.sh -c harvester
to port forward prometheus API endpoint.node ./dev/ide/profile-workspace.js
to start monitoring resource usage of a workspace pod in prev envs.curl http://localhost:22999/_supervisor/v1/status/resources
to check resource status of supervisor./.supervisor/supervisor top
to see resources status of the current workspace via internal CLIwatch tail -n 1 /workspace/gitpod/dev/ide/perf.log
Release Notes
Documentation