Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance test PVC with a cluster saturated with workspaces #12747

Closed
Tracked by #7901
kylos101 opened this issue Sep 7, 2022 · 18 comments
Closed
Tracked by #7901

Performance test PVC with a cluster saturated with workspaces #12747

kylos101 opened this issue Sep 7, 2022 · 18 comments
Assignees
Labels
aspect: performance anything related to performance aspect: testing Anything related to testing Gitpod manually, automated integration tests or even unit tests

Comments

@kylos101
Copy link
Contributor

kylos101 commented Sep 7, 2022

Is your feature request related to a problem? Please describe

We should test how a node behaves when it is full of workspaces using PVCs, where there is disk activity in the workspaces, and many workspaces are starting, running, and stopping.

The # of regular workspaces we decide to run as part of this test should be 125% of the peek volume we saw in our EU cluster over the last month.

Describe the behaviour you'd like

  1. Start a regular workspace in the cluster using PVC on a single node.
  2. Run loadgen, once to fill half of the cluster to capacity
  3. Run loadgen
    Begin to stop the first loadgen run & start the second one with ~20 workspaces

Additional context

We haven't run with PVC snapshot or restore at scale, yet.

@kylos101 kylos101 added aspect: testing Anything related to testing Gitpod manually, automated integration tests or even unit tests aspect: performance anything related to performance labels Sep 7, 2022
@sagor999
Copy link
Contributor

It worked without issues.
Tested with loadgen with 20 workspaces, then with 30, then with 100. (20 and 30 with stop and start at the same time).

@kylos101
Copy link
Contributor Author

@sagor999 why 100 workspaces, instead of 125% of our peek volume in the last month? 🤔 I would feel comfortable, even though it will cost to run the test, that we can handle that related volume. This way we're not doing for the first time in production. 😅

Can you show the start and stop graphs from the overview dashboard, so we can peek at what those shapes look like? For example, I don't assume they'll be much different than now...but am curious to confirm/compare.

Also, can you share how we're supposed to do related clean-up after testing in our loadgen readme.md? This will be good for the future.

Lastly, if you could add this to our project and set status, that would be 👌 , too! 🙇

@sagor999
Copy link
Contributor

@kylos101 was concerned about the cost. But if you are ok with it, I can do loadgen with 700 workspaces. ok?
image
Here is the graph of workspace start up. It does look the same as regular workspaces to me.

@sagor999 sagor999 moved this to In Progress in 🌌 Workspace Team Sep 23, 2022
@sagor999 sagor999 self-assigned this Sep 23, 2022
@kylos101
Copy link
Contributor Author

👋 @sagor999 thanks for sharing bud, I am okay with the cost 👍 👍 , especially it finds issues or mitigates risk. Could you do 900? We recently hit 830. 🚀 So long as you can delete snapshots after, we should be 🆗 .

@kylos101
Copy link
Contributor Author

kylos101 commented Oct 3, 2022

Hi @sagor999 , it looks like we have [an issue](https://github.com//issues/13353), sometimes, and, we're unsure if it's related to load.

For now, I recommend the following:

  1. Close this issue, 12747
  2. In [PVC] loadgen testing Pod can't mount Volume #13353, run a load test with a large number of workspaces, to see whether that stress causes trouble.

@kylos101
Copy link
Contributor Author

kylos101 commented Oct 5, 2022

As #13353 was resolved, I recommend proceeding with a load test of large scale. 👯

@jenting
Copy link
Contributor

jenting commented Oct 6, 2022

As #13353 was resolved, I recommend proceeding with a load test of large scale. 👯

Pavel asked me to load the test with 1000 regular workspaces. https://www.notion.so/gitpod/PVC-roll-out-plan-for-SaaS-b2fc4aa9a6304bd283263cdce008911d#d606efcba21a4cb4b4182f530273d154

@jenting jenting assigned jenting and unassigned sagor999 Oct 7, 2022
@kylos101
Copy link
Contributor Author

kylos101 commented Oct 8, 2022

👍 makes sense, you might need to increase the quota beforehand. Please submit a request a few days prior to increase the quota. @vulkoingim shared some advice on how to do, if you have any trouble or questions, let us know?

@vulkoingim
Copy link
Contributor

👍 makes sense, you might need to increase the quota beforehand. Please submit a request a few days prior to increase the quota. @vulkoingim shared some advice on how to do, if you have any trouble or questions, let us know?

I have requested an increase last Thursday, and have contacted our account manager to help push the case - I'll keep you posted with any news.

@jenting
Copy link
Contributor

jenting commented Oct 12, 2022

Leaving a note here.

Running 1k workspaces test took over 1 hour, so we had to increase the workspace timeout from 1 hour to 3 hours before running the loadgen test. Otherwise, after the loadgen reaches 1k workspaces, the number of running workspaces would be less than 1k.

@jenting
Copy link
Contributor

jenting commented Oct 12, 2022

Grafana dashboard

We need to increase the workspace timeout from 1 hour to 3 hours before running the loadgen test.
image
image

The workspace startup time is under 8.53 mins.
image
image
image

The workspace failures per second
image

@jenting
Copy link
Contributor

jenting commented Oct 13, 2022

Change the loadgen custom timeout value to 180m (3 hours) when the start workspace does not work. The workspaces still time out after 60 minutes 🤔. Probably we need to change the server default time out value as well.

As an alternative, I create multiple loadgen to stress testing until we reach 1k workspaces.

Below is the overall result, and in general, it looks good to me.
Except there are still some workspace failures per second happened, and I think it's because the workspace disposal process still depends on the ws-daemon even though we use PVC. cc @sagor999

Grafana dashboard

image
image

@sagor999
Copy link
Contributor

Yeah, it does depend on ws-daemon, since ws-daemon tracks workspace state in json file. So during dispose we still have to tell daemon to clean up that state file on its end. (something that we can get rid of in wsman mk2).

Have you looked at the reason for failures?

@jenting
Copy link
Contributor

jenting commented Oct 14, 2022

Have you looked at the reason for failures?

@sagor999

Yes, the ws-daemon reports it can't find the workspace. The workspace pod is terminating forever, the PVC is bound, and the VolumeSnapshot is ready to use.

I am wondering if should we always take PVC snapshots even if the ws-daemon reports it can't find the workspace?
I make a commit change, but I am not sure if it's a good way to go. Can you please review this commit?

@sagor999
Copy link
Contributor

Then it is most probably related to ws-daemon bookkeeping or something similar. At the very least it does not seem related to PVC code, which is good. :)

Re: Commit
that does look good I think. I think you are right, and there is no need to wait for daemon if using PVC. I guess only edge case I can think of right now if workspace never got ready (WaitForInit waits for ready state), then we will still create volume snapshot and potentially can mess up backup of that workspace. 🤔

@jenting
Copy link
Contributor

jenting commented Oct 14, 2022

I guess only edge case I can think of right now if workspace never got ready (WaitForInit waits for ready state), then we will still create volume snapshot and potentially can mess up backup of that workspace. 🤔

I think we could simplify the flow to check the workspaceNeverReadyAnnotation during the workspace disposal process, because the annotation workspaceNeverReadyAnnotation be removed once the workspace initialization is ready.

Then, call the finalizeWorkspaceContent when the workspaceNeverReadyAnnotation annotation not exist. So, we could get ride of code. Reference commit.

@kylos101
Copy link
Contributor Author

👋 @jenting what is left for this issue, or, can we consider the stress testing for a saturated cluster done? I ask because I see you created #13856, but am not sure if there are other issues.

@jenting
Copy link
Contributor

jenting commented Oct 15, 2022

No other issues, let's close it.

@jenting jenting closed this as completed Oct 15, 2022
Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 15, 2022
@jenting jenting moved this from Awaiting Deployment to Done in 🌌 Workspace Team Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: performance anything related to performance aspect: testing Anything related to testing Gitpod manually, automated integration tests or even unit tests
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants