Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jb: java perf monitoring #9796

Merged
merged 3 commits into from
May 18, 2022
Merged

jb: java perf monitoring #9796

merged 3 commits into from
May 18, 2022

Conversation

akosyakov
Copy link
Member

@akosyakov akosyakov commented May 5, 2022

Description

See internal doc for results of perf testing.

This PR adds tooling which can be used by us or end users to control and monitor resources consumptions of workspaces generally and particularly JB backends:

  • adds a supervisor endpoint to fetch resources status (memory and cpu) with respect of cgroup v1
  • provide internal CLI command /.supervisor/supervisor top to monitor resources status based on endpoint
  • contributes env var for each JB product to control max heap size (XMX) via [product]_XMX, i.e. INTELLIJ_XMX=4096M
  • monitor JB backends max/used memory in bytes and report it to prometheus for each workspace pod
  • contributes a dev time script node ./dev/ide/profile-workspace.js which capture avg/max/90% percentile of memory and CPU usage as well as max/used JB backend memory for a single workspace running on VM

Out of scope

Related Issue(s)

Fixes #9521

How to test

Important only one person can test at the time. So comment below that you are testing and when done remove a comment.

Release Notes

NONE

Documentation

  • /werft with-vm
  • /werft without-vm
  • /werft with-clean-slate-deployment

@akosyakov akosyakov force-pushed the ak/java_perf branch 5 times, most recently from 3f08535 to cba707d Compare May 10, 2022 13:00
@akosyakov akosyakov changed the title java perf testing java perf testing tooling May 12, 2022
@akosyakov akosyakov changed the title java perf testing tooling jb: java perf testing tooling May 12, 2022
@akosyakov akosyakov changed the title jb: java perf testing tooling jb: java perf monitoring tooling May 12, 2022
@akosyakov akosyakov changed the title jb: java perf monitoring tooling jb: java perf monitoring May 12, 2022
@akosyakov akosyakov force-pushed the ak/java_perf branch 2 times, most recently from 9469cea to 110b544 Compare May 13, 2022 09:09
@akosyakov akosyakov marked this pull request as ready for review May 13, 2022 09:13
@akosyakov akosyakov requested a review from a team May 13, 2022 09:13
@akosyakov akosyakov requested a review from csweichel as a code owner May 13, 2022 09:13
@mustard-mh
Copy link
Contributor

mustard-mh commented May 13, 2022

Follow up How to test section, works well for me need further test

curl top-cli watch-log
curl top-cli watch-log

@mustard-mh
Copy link
Contributor

I'm testing this PR

@mustard-mh
Copy link
Contributor

mustard-mh commented May 13, 2022

How to check if xmx=4096m works? @akosyakov

watch tail -n 2 /tmp/jb_mem_profile.log I thought this should be 4096m, but not related.

Every 2.0s: tail -n 2 /tmp/jb_mem_profile.log                  eclipse-xtextcore-umw6kzpgi0x: Fri May 13 10:28:03 2022

allocated - current 1245M, avg 1235M, max 1245M, 90% percentile 1245M
used - current 984M,avg 766M, max 1128M, 90% percentile 996M

@akosyakov
Copy link
Member Author

How to check if xmx=4096m works? @akosyakov

It is a bit tricky:

  • in performance indicator click ... and then select main window. It will render UI of backend in another window.
  • In main window press double shift and then enable Show Memory Usage toggle
  • Check max heap size in the status bar of this window.

@mustard-mh
Copy link
Contributor

mustard-mh commented May 13, 2022

Tested and works well (Do we need expected (related) result in how to test section?), should we explain what /workspace/gitpod/dev/ide/perf.log and /tmp/jb_mem_profile.log use for?

Copy link
Contributor

@mustard-mh mustard-mh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Code looks good

Comment on lines +21 to +34
func Top(ctx context.Context) (*api.ResourcesStatusResponse, error) {
memory, err := resolveMemoryStatus()
if err != nil {
return nil, err
}
cpu, err := resolveCPUStatus()
if err != nil {
return nil, err
}
return &api.ResourcesStatusResponse{
Memory: memory,
Cpu: cpu,
}, nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should harmonise this with the way ws-daemon does it.
Either copy over some of that code, or stick it in something like common-go/cgroup.

@Furisto @utam0k could you guide here how to proceed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment the code in ws-daemon is not written to be used by other components. Factoring this out would be equivalent to starting our own cgroup library. Considering that the other options like using containerd's cgroup are awkward to use I still think this is the best way forward. This does not need to be done in this PR though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to relevant code please? 🙏

There also can be slight difference, i.e. I subtract evict-able memory to give a user more usefull data, i.e. simlar to what GCP showing. I am not sure what ws-daemon does.

Copy link
Member

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should harmonise this with the way ws-daemon does it.

After looking at the code I don't think it can be harmonised without refactoring, ws daemon does not have some functions like getting used memory without evictable caches. Instead it compare page caches after trying to reclaim memory. It also has special cases for different state of workspaces which is not applicable inside a workspace.

I'm not sure refactoring will make such sense trying to create a reusable library right now. Code in ws-daemon is not really source of truth, i.e. there should not be a situation when you change one and need to change another.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder also why ws-daemon does not use container runtime endpoints but need to read from the disk?

Thanks for your feedback! Sorry, I couldn't get what the disk is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback! Sorry, I couldn't get what the disk is.

I mean container runtime has such API which can return usage/limit stats for cpu and memory: https://github.com/opencontainers/runc/blob/067aaf8548d78269dcb2c13b856775e27c410f9c/libcontainer/cgroups/stats.go#L158-L161

Supervisor cannot reach this endpoint, so we have to read cgroup from filesystem. I am asking whether ws-daemon can rely on it instead and provide this information to supervisor somehow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API you have linked would mean to execute runc stat and then parse the output, this is not a service endpoint. We also need to write to the cgroup filesystem not only read from it and we should strive to do so in a runtime agnostic manner. We could ask this information from containerd but if we allow the option to use any CRI compatible runtime in the future (like CRI-O) this would require us to adapt the code again.

Copy link
Member Author

@akosyakov akosyakov May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sorry i don't really know it and thought opencontainers/runc is an abstraction over different runtimes and using this API oppositely will make things more stable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@csweichel Would it be alright to continue as suggested here? Could you approve it please if so? 🙏

@akosyakov
Copy link
Member Author

akosyakov commented May 17, 2022

/werft run

👍 started the job as gitpod-build-ak-java-perf.41
(with .werft/ from main)

@akosyakov akosyakov marked this pull request as ready for review May 17, 2022 13:53
@mustard-mh
Copy link
Contributor

Code looks good, will test it with Prometheus

@mustard-mh mustard-mh self-requested a review May 17, 2022 14:16
Copy link
Contributor

@mustard-mh mustard-mh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested good

used max
2022-05-17_22 29 20 2022-05-17_22 29 06

Another way to check xmx=4096m in thin client (related #9796 (comment)

Img
image

@mustard-mh
Copy link
Contributor

mustard-mh commented May 17, 2022

Feel free to unhold it 👍 @akosyakov

@akosyakov
Copy link
Member Author

/unhold

@roboquat roboquat merged commit faa6b30 into main May 18, 2022
@roboquat roboquat deleted the ak/java_perf branch May 18, 2022 07:36
@roboquat roboquat added deployed: IDE IDE change is running in production deployed Change is completely running in production labels May 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deployed: IDE IDE change is running in production deployed Change is completely running in production release-note-none size/XXL team: IDE
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose perf details for the current workspace
6 participants