Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-manager-bridge] Garbage-collect workspace instances whose workspace clusters are not available anymore #6770

Open
geropl opened this issue Nov 18, 2021 · 13 comments
Labels
component: ws-manager-bridge meta: never-stale This issue can never become stale size/XL team: webapp Issue belongs to the WebApp team type: bug Something isn't working

Comments

@geropl
Copy link
Member

geropl commented Nov 18, 2021

context:

Front logo Front conversations

@geropl geropl added type: bug Something isn't working component: ws-manager-bridge team: webapp Issue belongs to the WebApp team labels Nov 18, 2021
@jldec jldec moved this to In Groundwork in 🍎 WebApp Team Nov 21, 2021
@JanKoehnlein
Copy link
Contributor

I am trying to reverse-engineer the action for this issue form the given context. Please clarify: Is this about

  1. adding an additional (timeout-based?) mechanism on the ws-manager-bridge that also works when the cluster goes donw without deregistering, or
  2. taking further action in the ws-manager-bridge when receiving a forced deregistration request
  3. None of the above (please specify)

@geropl
Copy link
Member Author

geropl commented Jan 4, 2022

Sorry for lagging details: This is about adding a mechanism that ensures we don't leak workspaces dangling in any state other than stopped once a workspace cluster is de-registered.

We could discuss whether this should be:
a) timeout-based, and workspace may re-appear if we re-register a workspace quickly enough. This would make it a bit more fault-tolerant and safe. This is samewhat tricky, as we need to poll d_b_workspace by region. But it's impossible atm for ws-manager-bridge to distinguish between "I'm don't know this region but someone else does" and "no-one is governing this region".
b) using the "forced de-registration" request as signal might serve as a first version. Although this might lead to unwanted fall-out in case we mis-used the gptcl clusters command.

a) will become possible once we have the changes required for simplified meta, hence I did not prioritize this, yet.

@geropl
Copy link
Member Author

geropl commented Jan 4, 2022

Assigned this to epic "Simplified Multi-Meta"

@geropl
Copy link
Member Author

geropl commented Apr 4, 2022

Unassigned, because there's no immediate connection

@geropl geropl moved this to Scheduled in 🍎 WebApp Team Apr 4, 2022
@geropl geropl added the size/XL label Apr 11, 2022
@geropl geropl removed the status in 🍎 WebApp Team Apr 14, 2022
@stale
Copy link

stale bot commented Jul 10, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Jul 10, 2022
@stale stale bot closed this as completed Aug 13, 2022
@stale stale bot moved this to Done in 🍎 WebApp Team Aug 13, 2022
@geropl geropl added meta: never-stale This issue can never become stale and removed meta: stale This issue/PR is stale and will be closed soon labels Aug 19, 2022
@geropl geropl moved this from Done to Scheduled in 🍎 WebApp Team Aug 19, 2022
@geropl geropl reopened this Aug 19, 2022
Repository owner moved this from Scheduled to In Progress in 🍎 WebApp Team Aug 19, 2022
@geropl geropl moved this from In Progress to Scheduled in 🍎 WebApp Team Aug 19, 2022
@geropl geropl removed the status in 🍎 WebApp Team Sep 12, 2022
@geropl geropl moved this to Scheduled in 🍎 WebApp Team Sep 12, 2022
@jankeromnes jankeromnes self-assigned this Sep 12, 2022
@geropl
Copy link
Member Author

geropl commented Sep 12, 2022

@jankeromnes Please ping if I can help with additional context/details 👍

@jankeromnes
Copy link
Contributor

Many thanks @geropl! Planning to start looking into either this or #12283 once #12580 is done.

@jankeromnes
Copy link
Contributor

jankeromnes commented Sep 13, 2022

Okay, I'm now blocked on this question and have already started one distraction, so I guess now is the time to pick this up! 😆 🚀

@jankeromnes
Copy link
Contributor

From this comment #6770 (comment) I deduce that we want:

b) using the "forced de-registration" request as signal might serve as a first version. Although this might lead to unwanted fall-out in case we mis-used the gptcl clusters command.

Please let me know if I got that wrong.

@jankeromnes
Copy link
Contributor

@geropl Would love to chat about this more, if you have time. I've added TODOs in two potential fix locations in https://github.com/gitpod-io/gitpod/pull/12912/files but I'm not really sure which is best or what the implications are. 💭

@jankeromnes
Copy link
Contributor

Had a brief chat this morning. Summary:

  • Marking all running instances as stopped/failed in the "force-de-register cluster" RPC call makes sense (no need to do it in the reconciler)
  • In a second step (maybe a follow-up), we'll likely need an additional singleton bridge that regularly garbage-collects all instances that are currently "running" in a cluster that is no longer registered

@geropl
Copy link
Member Author

geropl commented Oct 25, 2022

Moved to next week so we can implement a GC using info about "all registered clusters".

@geropl geropl removed the status in 🍎 WebApp Team Nov 14, 2022
@geropl geropl removed their assignment Nov 14, 2022
@geropl
Copy link
Member Author

geropl commented Nov 14, 2022

Dropping assignment because not actively working on it atm.

@geropl geropl moved this to Scheduled in 🍎 WebApp Team Nov 28, 2022
@geropl geropl removed the status in 🍎 WebApp Team Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: ws-manager-bridge meta: never-stale This issue can never become stale size/XL team: webapp Issue belongs to the WebApp team type: bug Something isn't working
Projects
Status: No status
3 participants