Please force-stop workspace instances that are "stuck" in a bad state #5016

jankeromnes · 2021-07-29T16:58:38Z

Following today's incident: #5005 a few workspaces and prebuilds got stuck in a bad state.

Specifically, they have a workspace instance stuck in phase = 'preparing', even though no build is in progress (anymore) -- and they have remained in this "temporary" state for several hours at least.

It would be great to forcefully stop them again, so that they can be restarted, e.g. from an earlier backup.

The text was updated successfully, but these errors were encountered:

jankeromnes · 2021-07-29T17:04:14Z

Note: I've tried force-stopping my own stuck workspace instance (magenta-mongoose-gje7qut8 / 038c631b-c37a-4f6d-983d-3db7416a409b).

Before my intervention, the instance looked like this in Gitpod's DB:

mysql> select * from d_b_workspace_instance where workspaceId = 'magenta-mongoose-gje7qut8';
+--------------------------------------+---------------------------+--------------------------+-------------+-------------+---------------+--------+------------+--------------------------------------------------------------------------------------------------------+--------+--------------+--------------------+----------------------------+------------------------------------------------------------------+-----------+---------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| id                                   | workspaceId               | creationTime             | startedTime | stoppedTime | lastHeartbeat | ideUrl | status_old | workspaceImage                                                                                         | region | deployedTime | workspaceBaseImage | _lastModified              | status                                                           | phase     | deleted | phasePersisted | configuration                                                                                                                                                                                               | stoppingTime |
+--------------------------------------+---------------------------+--------------------------+-------------+-------------+---------------+--------+------------+--------------------------------------------------------------------------------------------------------+--------+--------------+--------------------+----------------------------+------------------------------------------------------------------+-----------+---------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
| 038c631b-c37a-4f6d-983d-3db7416a409b | magenta-mongoose-gje7qut8 | 2021-07-29T13:47:49.079Z |             |             |               |        | NULL       | eu.gcr.io/gitpod-dev/workspace-images:a44ccc52fef46abe332af887644316e91ba904bddee31a7b90eab268be635e60 |        |              |                    | 2021-07-29 13:47:50.365675 | {"phase": "preparing", "conditions": {"neededImageBuild": true}} | preparing |       0 | preparing      | {"theiaVersion":"commit-0941a0805dc3c7345c45bd926317eaf045d4b7fb","ideImage":"eu.gcr.io/gitpod-core-dev/build/ide/code:commit-0941a0805dc3c7345c45bd926317eaf045d4b7fb","featureFlags":["fixed_resources"]} |              |
+--------------------------------------+---------------------------+--------------------------+-------------+-------------+---------------+--------+------------+--------------------------------------------------------------------------------------------------------+--------+--------------+--------------------+----------------------------+------------------------------------------------------------------+-----------+---------+----------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+
1 row in set (0.00 sec)

I then did this in order to force-stop it:

mysql> update d_b_workspace_instance set status = JSON_SET(status, '$.phase', 'stopped'), phasePersisted = 'stopped' where id = '038c631b-c37a-4f6d-983d-3db7416a409b';

This successfully returned my workspace back to "stopped" in my Gitpod dashboard, however when I tried to start it again, it said:

cannot initialize workspace: cannot initialize workspace: content initializer failed; last backup failed: workspace does not exist. Please contact support if you need the workspace data.

I suppose that's because there was never a backup made for this workspace.

I wonder, what is the correct fix here? Delete the instance stuck in phase = 'preparing'?

jankeromnes · 2021-07-29T17:06:18Z

On a side note, there are several workspace instances stuck in similar "temporary" states, sometimes for quite a long time:

For > 24h

mysql> select count(id) instances, phasePersisted from d_b_workspace_instance where phase != 'stopped' and creationTime < NOW() - INTERVAL 1 DAY group by phasePersisted;
+-----------+----------------+
| instances | phasePersisted |
+-----------+----------------+
|         4 | pending        |
|        36 | preparing      |
|         1 | running        |
|       878 | stopping       |
|        11 | unknown        |
+-----------+----------------+

for > 1 month

mysql> select count(id) instances, phasePersisted from d_b_workspace_instance where phase != 'stopped' and creationTime < NOW() - INTERVAL 1 MONTH group by phasePersisted;
+-----------+----------------+
| instances | phasePersisted |
+-----------+----------------+
|         4 | pending        |
|        24 | preparing      |
|         1 | running        |
|         3 | stopping       |
|         9 | unknown        |
+-----------+----------------+

I suppose these should all get forcefully stopped at some point.

svenefftinge · 2021-08-01T13:23:54Z

Most of them are prebuilds. One that wasn't a prebuild had lots of bash: eval: line 22: syntax error: unexpected end of file errors logged from workspace component.

csweichel · 2021-08-01T19:26:47Z

We should have ws-manager-bridge correct such instances as part of its regular reconciliation loop.
If a workspace instance is stuck in preparing for longer than 2h, we would reset it to stopped.

csweichel · 2021-08-01T19:26:52Z

/schedule

svenefftinge · 2021-08-02T07:38:09Z

Is this a dupe -> #4955

csweichel · 2021-08-03T08:25:44Z

/assign @mrsimonemms

jankeromnes added operations: past incident This issue arose during a past incident or its post-mortem priority: highest (user impact) Directly user impacting labels Jul 29, 2021

jankeromnes changed the title ~~Please force-stop a workspace instance that got stuck in "Building Image"~~ Please force-stop workspace instances that got stuck in "Building Image" Jul 29, 2021

jankeromnes changed the title ~~Please force-stop workspace instances that got stuck in "Building Image"~~ Please force-stop workspace instances that are "stuck" in a bad state Jul 29, 2021

roboquat added the groundwork: scheduled label Aug 1, 2021

jankeromnes mentioned this issue Aug 2, 2021

Images stuck in "Building Image" message #5005

Closed

roboquat assigned mrsimonemms Aug 3, 2021

roboquat added groundwork: in progress and removed groundwork: scheduled labels Aug 3, 2021

mrsimonemms mentioned this issue Aug 4, 2021

fix(workspace): force-stop workspaces stuck in a bad state #5055

Merged

mrsimonemms linked a pull request Aug 6, 2021 that will close this issue

fix(workspace): force-stop workspaces stuck in a bad state #5055

Merged

roboquat added groundwork: in review and removed groundwork: in progress labels Aug 6, 2021

roboquat closed this as completed in #5055 Aug 6, 2021

roboquat added groundwork: awaiting deployment and removed groundwork: in review labels Aug 6, 2021

mrsimonemms linked a pull request Aug 13, 2021 that will close this issue

[workspace]: add force-stop check on stopping workspaces #5184

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Please force-stop workspace instances that are "stuck" in a bad state #5016

Please force-stop workspace instances that are "stuck" in a bad state #5016

jankeromnes commented Jul 29, 2021 •

edited

Loading

jankeromnes commented Jul 29, 2021

jankeromnes commented Jul 29, 2021

svenefftinge commented Aug 1, 2021

csweichel commented Aug 1, 2021

csweichel commented Aug 1, 2021

svenefftinge commented Aug 2, 2021

csweichel commented Aug 3, 2021

Please force-stop workspace instances that are "stuck" in a bad state #5016

Please force-stop workspace instances that are "stuck" in a bad state #5016

Comments

jankeromnes commented Jul 29, 2021 • edited Loading

jankeromnes commented Jul 29, 2021

jankeromnes commented Jul 29, 2021

For > 24h

for > 1 month

svenefftinge commented Aug 1, 2021

csweichel commented Aug 1, 2021

csweichel commented Aug 1, 2021

svenefftinge commented Aug 2, 2021

csweichel commented Aug 3, 2021

jankeromnes commented Jul 29, 2021 •

edited

Loading