-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance test PVC with a cluster saturated with workspaces #12747
Comments
It worked without issues. |
@sagor999 why 100 workspaces, instead of 125% of our peek volume in the last month? 🤔 I would feel comfortable, even though it will cost to run the test, that we can handle that related volume. This way we're not doing for the first time in production. 😅 Can you show the start and stop graphs from the overview dashboard, so we can peek at what those shapes look like? For example, I don't assume they'll be much different than now...but am curious to confirm/compare. Also, can you share how we're supposed to do related clean-up after testing in our loadgen readme.md? This will be good for the future. Lastly, if you could add this to our project and set status, that would be 👌 , too! 🙇 |
@kylos101 was concerned about the cost. But if you are ok with it, I can do loadgen with 700 workspaces. ok? |
👋 @sagor999 thanks for sharing bud, I am okay with the cost 👍 👍 , especially it finds issues or mitigates risk. Could you do 900? We recently hit 830. 🚀 So long as you can delete snapshots after, we should be 🆗 . |
For now, I recommend the following:
|
As #13353 was resolved, I recommend proceeding with a load test of large scale. 👯 |
Pavel asked me to load the test with 1000 regular workspaces. https://www.notion.so/gitpod/PVC-roll-out-plan-for-SaaS-b2fc4aa9a6304bd283263cdce008911d#d606efcba21a4cb4b4182f530273d154 |
👍 makes sense, you might need to increase the quota beforehand. Please submit a request a few days prior to increase the quota. @vulkoingim shared some advice on how to do, if you have any trouble or questions, let us know? |
I have requested an increase last Thursday, and have contacted our account manager to help push the case - I'll keep you posted with any news. |
Leaving a note here. Running 1k workspaces test took over 1 hour, so we had to increase the workspace timeout from 1 hour to 3 hours before running the loadgen test. Otherwise, after the loadgen reaches 1k workspaces, the number of running workspaces would be less than 1k. |
Change the loadgen custom timeout value to 180m (3 hours) when the start workspace does not work. The workspaces still time out after 60 minutes 🤔. Probably we need to change the server default time out value as well. As an alternative, I create multiple loadgen to stress testing until we reach 1k workspaces. Below is the overall result, and in general, it looks good to me. |
Yeah, it does depend on ws-daemon, since ws-daemon tracks workspace state in json file. So during dispose we still have to tell daemon to clean up that state file on its end. (something that we can get rid of in wsman mk2). Have you looked at the reason for failures? |
Yes, the ws-daemon reports it can't find the workspace. The workspace pod is terminating forever, the PVC is bound, and the VolumeSnapshot is ready to use. I am wondering if should we always take PVC snapshots even if the ws-daemon reports it can't find the workspace? |
Then it is most probably related to ws-daemon bookkeeping or something similar. At the very least it does not seem related to PVC code, which is good. :) Re: Commit |
I think we could simplify the flow to check the Then, call the finalizeWorkspaceContent when the |
No other issues, let's close it. |
Is your feature request related to a problem? Please describe
We should test how a node behaves when it is full of workspaces using PVCs, where there is disk activity in the workspaces, and many workspaces are starting, running, and stopping.
The # of regular workspaces we decide to run as part of this test should be 125% of the peek volume we saw in our EU cluster over the last month.
Describe the behaviour you'd like
Begin to stop the first loadgen run & start the second one with ~20 workspaces
Additional context
We haven't run with PVC snapshot or restore at scale, yet.
The text was updated successfully, but these errors were encountered: