Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bridge] Ensure we do not override "failed" instance states (affects prebuilds!) #8596

Closed
Tracked by #7812 ...
geropl opened this issue Mar 4, 2022 · 7 comments
Closed
Tracked by #7812 ...
Assignees
Labels
aspect: error-handling Issues which improve error handling when something fails in Gitpod component: ws-manager-bridge team: webapp Issue belongs to the WebApp team type: bug Something isn't working

Comments

@geropl
Copy link
Member

geropl commented Mar 4, 2022

Currently we're seeing that failed prebuild instances' status is override by newer instance updates lagging that state (ref).

While this is a bug, we should be failsafe here, and basically:

  • never overwrite the failed condition
  • add additional logging for unexpected updates
  • accept and forward ws-manager version field
@geropl geropl added component: ws-manager-bridge aspect: error-handling Issues which improve error handling when something fails in Gitpod team: webapp Issue belongs to the WebApp team labels Mar 4, 2022
@geropl geropl moved this to Scheduled in 🍎 WebApp Team Mar 4, 2022
@geropl geropl changed the title [bridge] Ensure we do not override "failed" instance states [bridge] Ensure we do not override "failed" instance states (affects prebuilds!) Mar 4, 2022
@geropl geropl added the type: improvement Improves an existing feature or existing code label Mar 4, 2022
@geropl
Copy link
Member Author

geropl commented Apr 4, 2022

@easyCZ This was the issue I tried to refer to: it's relevant for prebuild reliability, but also for other workspaces

The version field won't help with this concrete problem (because ws-manager happily gives out a higher version for earlier, already gone-thorough phases). But it's nice to forward anyway, and should help with reconnect issues in ws-manager-bridge (and potentially, all downstream channels).

@easyCZ
Copy link
Member

easyCZ commented Apr 4, 2022

Thanks! I've also cut #9107.

The version field won't help with this concrete problem

Are you referring to the status_version field or is this some other version field?

@geropl
Copy link
Member Author

geropl commented Apr 5, 2022

Are you referring to the status_version field or is this some other version field?

No, that's the field I meant! 💯

@easyCZ
Copy link
Member

easyCZ commented Apr 6, 2022

The prebuild part of this issue has now landed in #9116. I am going to first validate the effects of this change on Prebuilds before making the corresponding change for Workspace Instance updates.

Leaving this open as we still need to make the same change for WorkspaceInstanceStatus updates.

@geropl geropl added type: bug Something isn't working and removed type: improvement Improves an existing feature or existing code labels Apr 14, 2022
@easyCZ easyCZ self-assigned this Apr 20, 2022
@easyCZ easyCZ moved this from Scheduled to In Progress in 🍎 WebApp Team Apr 20, 2022
@easyCZ
Copy link
Member

easyCZ commented Apr 20, 2022

Using prior work to detect stale events, we can inspect if stale events are a problem for WorkspaceInstance updates also. In practice, we haven't observed stale event updates for prebuilds. The code-flow is as follows:

  1. Receive a workspace instance update
  2. Process it, and re-map to a db-record so that we can communicate the WorkspaceInstance status
  3. Invoke a controller loop for prebuilds. If the update is not for a Prebuild workspace instance, we ignore it.

This means that prebuild updates are a reasonable (but complete) baseline for "stale updates".

Screenshot 2022-04-20 at 10 32 33

@geropl
Copy link
Member Author

geropl commented Jun 24, 2022

We're fixing this in two steps:

  1. roll out [bridge] Add log.error in case we are about to override a previous "failed" condition #10900 and observe logs for ~1 week to see if how often we're hitting that situation
  2. based on the data we gathered, decide how to proceed with this: Ideally by ignoring destructive updates to the "failed" condition 👍

@geropl
Copy link
Member Author

geropl commented Jul 4, 2022

It turns out we cannot simply disable this, because there is a host of different cases where we seem to rely on it. We have to investigate the different root causes on workspace side before we can move forward here.

How to investigate:

  1. pull ws-manager-bridge logs with this query:
"We received an empty \"failed\" condition overriding an existing one"
resource.labels.container_name="ws-manager-bridge"
  1. Download as CSV or similar, and turn into a list of instanceIds
  2. Use this DB query:
    SELECT wsi.id, wsi.workspaceId, ws.type, pws.state, pws.error, wsi.phasePersisted, wsi.status->"$.conditions.failed", wsi.status->"$.conditions.headlessTaskFailed", wsi.status->"$.conditions.timeout"
	FROM d_b_workspace AS ws
	JOIN d_b_workspace_instance AS wsi
		ON wsi.workspaceId = ws.id
	LEFT JOIN d_b_prebuilt_workspace AS pws
		ON ws.id = pws.buildWorkspaceId
	WHERE wsi.id IN ( <INSTANCE_IDS>);

I'm closing this one for now.

@geropl geropl closed this as completed Jul 4, 2022
Repository owner moved this from In Progress to Done in 🍎 WebApp Team Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aspect: error-handling Issues which improve error handling when something fails in Gitpod component: ws-manager-bridge team: webapp Issue belongs to the WebApp team type: bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

2 participants