Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please force-stop workspace instances that are "stuck" in a bad state #5016

Closed
jankeromnes opened this issue Jul 29, 2021 · 7 comments · Fixed by #5055 or #5184
Closed

Please force-stop workspace instances that are "stuck" in a bad state #5016

jankeromnes opened this issue Jul 29, 2021 · 7 comments · Fixed by #5055 or #5184
Assignees
Labels
groundwork: awaiting deployment operations: past incident This issue arose during a past incident or its post-mortem priority: highest (user impact) Directly user impacting

Comments

@jankeromnes
Copy link
Contributor

jankeromnes commented Jul 29, 2021

Following today's incident: #5005 a few workspaces and prebuilds got stuck in a bad state.

Specifically, they have a workspace instance stuck in phase = 'preparing', even though no build is in progress (anymore) -- and they have remained in this "temporary" state for several hours at least.

It would be great to forcefully stop them again, so that they can be restarted, e.g. from an earlier backup.

@jankeromnes jankeromnes added operations: past incident This issue arose during a past incident or its post-mortem priority: highest (user impact) Directly user impacting labels Jul 29, 2021
@jankeromnes
Copy link
Contributor Author

Note: I've tried force-stopping my own stuck workspace instance (magenta-mongoose-gje7qut8 / 038c631b-c37a-4f6d-983d-3db7416a409b).

Before my intervention, the instance looked like this in Gitpod's DB:

mysql> select * from d_b_workspace_instance where workspaceId = 'magenta-mongoose-gje7qut8';
+--------------------------------------+---------------------------+--------------------------+-------------+-------------+---------------+--------+------------+--------------------------------------------------------------------------------------------------------+--------+--------------+--------------------+----------------------------+------------------------------------------------------------------+-----------+---------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| id                                   | workspaceId               | creationTime             | startedTime | stoppedTime | lastHeartbeat | ideUrl | status_old | workspaceImage                                                                                         | region | deployedTime | workspaceBaseImage | _lastModified              | status                                                           | phase     | deleted | phasePersisted | configuration                                                                                                                                                                                               | stoppingTime |
+--------------------------------------+---------------------------+--------------------------+-------------+-------------+---------------+--------+------------+--------------------------------------------------------------------------------------------------------+--------+--------------+--------------------+----------------------------+------------------------------------------------------------------+-----------+---------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| 038c631b-c37a-4f6d-983d-3db7416a409b | magenta-mongoose-gje7qut8 | 2021-07-29T13:47:49.079Z |             |             |               |        | NULL       | eu.gcr.io/gitpod-dev/workspace-images:a44ccc52fef46abe332af887644316e91ba904bddee31a7b90eab268be635e60 |        |              |                    | 2021-07-29 13:47:50.365675 | {"phase": "preparing", "conditions": {"neededImageBuild": true}} | preparing |       0 | preparing      | {"theiaVersion":"commit-0941a0805dc3c7345c45bd926317eaf045d4b7fb","ideImage":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-0941a0805dc3c7345c45bd926317eaf045d4b7fb","featureFlags":["fixed_resources"]} |              |
+--------------------------------------+---------------------------+--------------------------+-------------+-------------+---------------+--------+------------+--------------------------------------------------------------------------------------------------------+--------+--------------+--------------------+----------------------------+------------------------------------------------------------------+-----------+---------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
1 row in set (0.00 sec)

I then did this in order to force-stop it:

mysql> update d_b_workspace_instance set status = JSON_SET(status, '$.phase', 'stopped'), phasePersisted = 'stopped' where id = '038c631b-c37a-4f6d-983d-3db7416a409b';

This successfully returned my workspace back to "stopped" in my Gitpod dashboard, however when I tried to start it again, it said:

cannot initialize workspace: cannot initialize workspace: content initializer failed; last backup failed: workspace does not exist. Please contact support if you need the workspace data.

I suppose that's because there was never a backup made for this workspace.


I wonder, what is the correct fix here? Delete the instance stuck in phase = 'preparing'?

@jankeromnes jankeromnes changed the title Please force-stop a workspace instance that got stuck in "Building Image" Please force-stop workspace instances that got stuck in "Building Image" Jul 29, 2021
@jankeromnes
Copy link
Contributor Author

On a side note, there are several workspace instances stuck in similar "temporary" states, sometimes for quite a long time:

For > 24h

mysql> select count(id) instances, phasePersisted from d_b_workspace_instance where phase != 'stopped' and creationTime < NOW() - INTERVAL 1 DAY group by phasePersisted;
+-----------+----------------+
| instances | phasePersisted |
+-----------+----------------+
|         4 | pending        |
|        36 | preparing      |
|         1 | running        |
|       878 | stopping       |
|        11 | unknown        |
+-----------+----------------+

for > 1 month

mysql> select count(id) instances, phasePersisted from d_b_workspace_instance where phase != 'stopped' and creationTime < NOW() - INTERVAL 1 MONTH group by phasePersisted;
+-----------+----------------+
| instances | phasePersisted |
+-----------+----------------+
|         4 | pending        |
|        24 | preparing      |
|         1 | running        |
|         3 | stopping       |
|         9 | unknown        |
+-----------+----------------+

I suppose these should all get forcefully stopped at some point.

@jankeromnes jankeromnes changed the title Please force-stop workspace instances that got stuck in "Building Image" Please force-stop workspace instances that are "stuck" in a bad state Jul 29, 2021
@svenefftinge
Copy link
Member

Most of them are prebuilds. One that wasn't a prebuild had lots of bash: eval: line 22: syntax error: unexpected end of file errors logged from workspace component.

@csweichel
Copy link
Contributor

We should have ws-manager-bridge correct such instances as part of its regular reconciliation loop.
If a workspace instance is stuck in preparing for longer than 2h, we would reset it to stopped.

@csweichel
Copy link
Contributor

/schedule

@svenefftinge
Copy link
Member

Is this a dupe -> #4955

@csweichel
Copy link
Contributor

/assign @mrsimonemms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
groundwork: awaiting deployment operations: past incident This issue arose during a past incident or its post-mortem priority: highest (user impact) Directly user impacting
Projects
None yet
5 participants