Performance test PVC with a cluster saturated with workspaces #12747

kylos101 · 2022-09-07T21:22:22Z

Is your feature request related to a problem? Please describe

We should test how a node behaves when it is full of workspaces using PVCs, where there is disk activity in the workspaces, and many workspaces are starting, running, and stopping.

The # of regular workspaces we decide to run as part of this test should be 125% of the peek volume we saw in our EU cluster over the last month.

Describe the behaviour you'd like

Start a regular workspace in the cluster using PVC on a single node.
Run loadgen, once to fill half of the cluster to capacity
Run loadgen
Begin to stop the first loadgen run & start the second one with ~20 workspaces

Additional context

We haven't run with PVC snapshot or restore at scale, yet.

sagor999 · 2022-09-22T22:50:54Z

It worked without issues.
Tested with loadgen with 20 workspaces, then with 30, then with 100. (20 and 30 with stop and start at the same time).

kylos101 · 2022-09-23T04:10:59Z

@sagor999 why 100 workspaces, instead of 125% of our peek volume in the last month? 🤔 I would feel comfortable, even though it will cost to run the test, that we can handle that related volume. This way we're not doing for the first time in production. 😅

Can you show the start and stop graphs from the overview dashboard, so we can peek at what those shapes look like? For example, I don't assume they'll be much different than now...but am curious to confirm/compare.

Also, can you share how we're supposed to do related clean-up after testing in our loadgen readme.md? This will be good for the future.

Lastly, if you could add this to our project and set status, that would be 👌 , too! 🙇

sagor999 · 2022-09-23T22:16:12Z

@kylos101 was concerned about the cost. But if you are ok with it, I can do loadgen with 700 workspaces. ok?

Here is the graph of workspace start up. It does look the same as regular workspaces to me.

kylos101 · 2022-09-23T22:18:50Z

👋 @sagor999 thanks for sharing bud, I am okay with the cost 👍 👍 , especially it finds issues or mitigates risk. Could you do 900? We recently hit 830. 🚀 So long as you can delete snapshots after, we should be 🆗 .

kylos101 · 2022-10-03T21:19:22Z

Hi @sagor999 , it looks like we have [an issue](https://github.com//issues/13353), sometimes, and, we're unsure if it's related to load.
For now, I recommend the following:

Close this issue, 12747

In [PVC] loadgen testing Pod can't mount Volume #13353, run a load test with a large number of workspaces, to see whether that stress causes trouble.

kylos101 · 2022-10-05T04:07:16Z

As #13353 was resolved, I recommend proceeding with a load test of large scale. 👯

jenting · 2022-10-06T01:46:10Z

As #13353 was resolved, I recommend proceeding with a load test of large scale. 👯

Pavel asked me to load the test with 1000 regular workspaces. https://www.notion.so/gitpod/PVC-roll-out-plan-for-SaaS-b2fc4aa9a6304bd283263cdce008911d#d606efcba21a4cb4b4182f530273d154

kylos101 · 2022-10-08T02:05:30Z

👍 makes sense, you might need to increase the quota beforehand. Please submit a request a few days prior to increase the quota. @vulkoingim shared some advice on how to do, if you have any trouble or questions, let us know?

vulkoingim · 2022-10-10T09:58:27Z

👍 makes sense, you might need to increase the quota beforehand. Please submit a request a few days prior to increase the quota. @vulkoingim shared some advice on how to do, if you have any trouble or questions, let us know?

I have requested an increase last Thursday, and have contacted our account manager to help push the case - I'll keep you posted with any news.

jenting · 2022-10-12T09:37:22Z

Leaving a note here.

Running 1k workspaces test took over 1 hour, so we had to increase the workspace timeout from 1 hour to 3 hours before running the loadgen test. Otherwise, after the loadgen reaches 1k workspaces, the number of running workspaces would be less than 1k.

jenting · 2022-10-12T09:41:46Z

Grafana dashboard

We need to increase the workspace timeout from 1 hour to 3 hours before running the loadgen test.

The workspace startup time is under 8.53 mins.

The workspace failures per second

jenting · 2022-10-13T05:48:15Z

Change the loadgen custom timeout value to 180m (3 hours) when the start workspace does not work. The workspaces still time out after 60 minutes 🤔. Probably we need to change the server default time out value as well.

As an alternative, I create multiple loadgen to stress testing until we reach 1k workspaces.

Below is the overall result, and in general, it looks good to me.
Except there are still some workspace failures per second happened, and I think it's because the workspace disposal process still depends on the ws-daemon even though we use PVC. cc @sagor999

Grafana dashboard

sagor999 · 2022-10-13T20:35:12Z

Yeah, it does depend on ws-daemon, since ws-daemon tracks workspace state in json file. So during dispose we still have to tell daemon to clean up that state file on its end. (something that we can get rid of in wsman mk2).

Have you looked at the reason for failures?

jenting · 2022-10-14T01:32:54Z

Have you looked at the reason for failures?

@sagor999

Yes, the ws-daemon reports it can't find the workspace. The workspace pod is terminating forever, the PVC is bound, and the VolumeSnapshot is ready to use.

I am wondering if should we always take PVC snapshots even if the ws-daemon reports it can't find the workspace?
I make a commit change, but I am not sure if it's a good way to go. Can you please review this commit?

sagor999 · 2022-10-14T02:13:05Z

Then it is most probably related to ws-daemon bookkeeping or something similar. At the very least it does not seem related to PVC code, which is good. :)

Re: Commit
that does look good I think. I think you are right, and there is no need to wait for daemon if using PVC. I guess only edge case I can think of right now if workspace never got ready (WaitForInit waits for ready state), then we will still create volume snapshot and potentially can mess up backup of that workspace. 🤔

jenting · 2022-10-14T05:39:31Z

I guess only edge case I can think of right now if workspace never got ready (WaitForInit waits for ready state), then we will still create volume snapshot and potentially can mess up backup of that workspace. 🤔

I think we could simplify the flow to check the workspaceNeverReadyAnnotation during the workspace disposal process, because the annotation workspaceNeverReadyAnnotation be removed once the workspace initialization is ready.

Then, call the finalizeWorkspaceContent when the workspaceNeverReadyAnnotation annotation not exist. So, we could get ride of code. Reference commit.

kylos101 · 2022-10-14T22:06:22Z

👋 @jenting what is left for this issue, or, can we consider the stress testing for a saturated cluster done? I ask because I see you created #13856, but am not sure if there are other issues.

jenting · 2022-10-15T08:32:43Z

No other issues, let's close it.

kylos101 added aspect: testing Anything related to testing Gitpod manually, automated integration tests or even unit tests aspect: performance anything related to performance labels Sep 7, 2022

kylos101 mentioned this issue Sep 7, 2022

Epic: Ensure durability for user workspace files #7901

Closed

77 tasks

sagor999 added this to 🌌 Workspace Team Sep 23, 2022

sagor999 moved this to In Progress in 🌌 Workspace Team Sep 23, 2022

sagor999 self-assigned this Sep 23, 2022

kylos101 mentioned this issue Oct 4, 2022

[PVC] loadgen testing Pod can't mount Volume #13353

Closed

jenting assigned jenting and unassigned sagor999 Oct 7, 2022

jenting mentioned this issue Oct 12, 2022

loadgen: support custom workspace timeout #13796

Merged

3 tasks

jenting mentioned this issue Oct 14, 2022

[PVC] massive workspace stopping, two workspaces report cannot find workspace from ws-daemon #13856

Closed

jenting closed this as completed Oct 15, 2022

Repository owner moved this from In Progress to Awaiting Deployment in 🌌 Workspace Team Oct 15, 2022

jenting moved this from Awaiting Deployment to Done in 🌌 Workspace Team Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance test PVC with a cluster saturated with workspaces #12747

Performance test PVC with a cluster saturated with workspaces #12747

kylos101 commented Sep 7, 2022

sagor999 commented Sep 22, 2022

kylos101 commented Sep 23, 2022

sagor999 commented Sep 23, 2022

kylos101 commented Sep 23, 2022

kylos101 commented Oct 3, 2022 •

edited

Loading

kylos101 commented Oct 5, 2022

jenting commented Oct 6, 2022 •

edited

Loading

kylos101 commented Oct 8, 2022 •

edited

Loading

vulkoingim commented Oct 10, 2022

jenting commented Oct 12, 2022

jenting commented Oct 12, 2022 •

edited

Loading

jenting commented Oct 13, 2022 •

edited

Loading

sagor999 commented Oct 13, 2022

jenting commented Oct 14, 2022

sagor999 commented Oct 14, 2022

jenting commented Oct 14, 2022 •

edited

Loading

kylos101 commented Oct 14, 2022

jenting commented Oct 15, 2022

Performance test PVC with a cluster saturated with workspaces #12747

Performance test PVC with a cluster saturated with workspaces #12747

Comments

kylos101 commented Sep 7, 2022

Is your feature request related to a problem? Please describe

Describe the behaviour you'd like

Additional context

sagor999 commented Sep 22, 2022

kylos101 commented Sep 23, 2022

sagor999 commented Sep 23, 2022

kylos101 commented Sep 23, 2022

kylos101 commented Oct 3, 2022 • edited Loading

kylos101 commented Oct 5, 2022

jenting commented Oct 6, 2022 • edited Loading

kylos101 commented Oct 8, 2022 • edited Loading

vulkoingim commented Oct 10, 2022

jenting commented Oct 12, 2022

jenting commented Oct 12, 2022 • edited Loading

jenting commented Oct 13, 2022 • edited Loading

sagor999 commented Oct 13, 2022

jenting commented Oct 14, 2022

sagor999 commented Oct 14, 2022

jenting commented Oct 14, 2022 • edited Loading

kylos101 commented Oct 14, 2022

jenting commented Oct 15, 2022

kylos101 commented Oct 3, 2022 •

edited

Loading

jenting commented Oct 6, 2022 •

edited

Loading

kylos101 commented Oct 8, 2022 •

edited

Loading

jenting commented Oct 12, 2022 •

edited

Loading

jenting commented Oct 13, 2022 •

edited

Loading

jenting commented Oct 14, 2022 •

edited

Loading