[bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

geropl · 2022-03-04T10:57:47Z

Currently we're seeing that failed prebuild instances' status is override by newer instance updates lagging that state (ref).

While this is a bug, we should be failsafe here, and basically:

never overwrite the failed condition
add additional logging for unexpected updates
accept and forward ws-manager version field

The text was updated successfully, but these errors were encountered:

geropl · 2022-04-04T14:26:10Z

@easyCZ This was the issue I tried to refer to: it's relevant for prebuild reliability, but also for other workspaces

The version field won't help with this concrete problem (because ws-manager happily gives out a higher version for earlier, already gone-thorough phases). But it's nice to forward anyway, and should help with reconnect issues in ws-manager-bridge (and potentially, all downstream channels).

easyCZ · 2022-04-04T14:41:51Z

Thanks! I've also cut #9107.

The version field won't help with this concrete problem

Are you referring to the status_version field or is this some other version field?

geropl · 2022-04-05T12:52:04Z

Are you referring to the status_version field or is this some other version field?

No, that's the field I meant! 💯

easyCZ · 2022-04-06T07:34:25Z

The prebuild part of this issue has now landed in #9116. I am going to first validate the effects of this change on Prebuilds before making the corresponding change for Workspace Instance updates.

Leaving this open as we still need to make the same change for WorkspaceInstanceStatus updates.

easyCZ · 2022-04-20T08:36:26Z

Using prior work to detect stale events, we can inspect if stale events are a problem for WorkspaceInstance updates also. In practice, we haven't observed stale event updates for prebuilds. The code-flow is as follows:

Receive a workspace instance update
Process it, and re-map to a db-record so that we can communicate the WorkspaceInstance status
Invoke a controller loop for prebuilds. If the update is not for a Prebuild workspace instance, we ignore it.

This means that prebuild updates are a reasonable (but complete) baseline for "stale updates".

geropl · 2022-06-24T12:50:27Z

We're fixing this in two steps:

roll out [bridge] Add log.error in case we are about to override a previous "failed" condition #10900 and observe logs for ~1 week to see if how often we're hitting that situation
based on the data we gathered, decide how to proceed with this: Ideally by ignoring destructive updates to the "failed" condition 👍

geropl · 2022-07-04T05:44:50Z

It turns out we cannot simply disable this, because there is a host of different cases where we seem to rely on it. We have to investigate the different root causes on workspace side before we can move forward here.

How to investigate:

pull ws-manager-bridge logs with this query:

"We received an empty \"failed\" condition overriding an existing one"
resource.labels.container_name="ws-manager-bridge"

Download as CSV or similar, and turn into a list of instanceIds
Use this DB query:

    SELECT wsi.id, wsi.workspaceId, ws.type, pws.state, pws.error, wsi.phasePersisted, wsi.status->"$.conditions.failed", wsi.status->"$.conditions.headlessTaskFailed", wsi.status->"$.conditions.timeout"
	FROM d_b_workspace AS ws
	JOIN d_b_workspace_instance AS wsi
		ON wsi.workspaceId = ws.id
	LEFT JOIN d_b_prebuilt_workspace AS pws
		ON ws.id = pws.buildWorkspaceId
	WHERE wsi.id IN ( <INSTANCE_IDS>);

I'm closing this one for now.

geropl added this to 🍎 WebApp Team Mar 4, 2022

geropl added component: ws-manager-bridge aspect: error-handling Issues which improve error handling when something fails in Gitpod team: webapp Issue belongs to the WebApp team labels Mar 4, 2022

geropl moved this to Scheduled in 🍎 WebApp Team Mar 4, 2022

This was referenced Mar 4, 2022

Failed to download OTS in US cluster (possibly happens for prebuilds, only) #8096

Closed

Epic: Improve reliability of prebuilds and prebuild logs #7812

Closed

geropl changed the title ~~[bridge] Ensure we do not override "failed" instance states~~ [bridge] Ensure we do not override "failed" instance states (affects prebuilds!) Mar 4, 2022

geropl added the type: improvement Improves an existing feature or existing code label Mar 4, 2022

This was referenced Apr 5, 2022

[db] Add statusVersion to prebuilds to track status version #9114

Merged

[ws-manager-bridge] Publish metrics with stale prebuild events #9115

Merged

easyCZ mentioned this issue Apr 5, 2022

[ws-manager-bridge] Skip stale prebuild events #9116

Merged

easyCZ mentioned this issue Apr 13, 2022

Epic: Prebuild transparency #9292

Closed

6 tasks

geropl added type: bug Something isn't working and removed type: improvement Improves an existing feature or existing code labels Apr 14, 2022

easyCZ self-assigned this Apr 20, 2022

easyCZ moved this from Scheduled to In Progress in 🍎 WebApp Team Apr 20, 2022

This was referenced Apr 21, 2022

[ws-man-bridge] Add started and completed metrics to track health #9422

Merged

[wsm-bridge] Dashboard with health metrics #9584

Merged

geropl unassigned easyCZ May 31, 2022

geropl removed the status in 🍎 WebApp Team May 31, 2022

geropl mentioned this issue May 31, 2022

Epic: Prebuilds stability #10361

Closed

11 tasks

geropl self-assigned this Jun 24, 2022

geropl moved this to In Progress in 🍎 WebApp Team Jun 24, 2022

geropl mentioned this issue Jun 24, 2022

[bridge] Add log.error in case we are about to override a previous "failed" condition #10900

Merged

1 task

geropl closed this as completed Jul 4, 2022

Repository owner moved this from In Progress to Done in 🍎 WebApp Team Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

[bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

geropl commented Mar 4, 2022

geropl commented Apr 4, 2022

easyCZ commented Apr 4, 2022 •

edited

Loading

geropl commented Apr 5, 2022

easyCZ commented Apr 6, 2022

easyCZ commented Apr 20, 2022

geropl commented Jun 24, 2022 •

edited

Loading

geropl commented Jul 4, 2022

[bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

[bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

Comments

geropl commented Mar 4, 2022

geropl commented Apr 4, 2022

easyCZ commented Apr 4, 2022 • edited Loading

geropl commented Apr 5, 2022

easyCZ commented Apr 6, 2022

easyCZ commented Apr 20, 2022

geropl commented Jun 24, 2022 • edited Loading

geropl commented Jul 4, 2022

easyCZ commented Apr 4, 2022 •

edited

Loading

geropl commented Jun 24, 2022 •

edited

Loading