-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912
Closed
Closed
When de-registering a workspace cluster, mark any leftover running instances as stopped/failed #12912
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like this approach.
So far, stopping a bridge was an operation that has absolutely no effect on workspaces. This made it a very cheap operation, and allowed for great operational flexibility. E.g., if you had to fixup a DB entry, you could always remove/re-add an entry, with very limited downsides (delay of workspace updates for a handful of seconds). Or, if you wanted for a reconnect, you could remove and re-add a DB entry. Now, it has a very destructive side-effect.
💭 I wonder what happens if we stop a
ws-manager-brige
pod during a rollout. It would stop all workspaces on that cluster, no? @jankeromnesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of having a periodic clean-up where we check and stop instances for which no ws-manager exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or a more general version: For all currently not-stopped instances, check it against ws-manager. This would catch a broader set of problems, but we already need to solve this problem anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something we need anyway, but with different time constraints: This PR is about unblocking Team Workspace in specific cases. This is a separate PR, but same issue.
That's a layer of abstraction above this PR/issue. tl;dr: we're already doing it, this is about the implementation details Happy to provide context, outside of this PR. 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks for the great feedback here! 💯
But why remove/re-add an entry when you can
gpctl clusters update
? 🤔 (E.g. to adjust the score)Can't you kill the
ws-manager-bridge
pod, orrollout restat
its deployment to achieve this?I wasn't sure, so I tested it:
kubectl delete pod ws-manager-bridge-[tab]
several times in a rowkubectl rollout restart deployment ws-manager-bridge
a few timesMy running workspace stayed alive and well all the time.
Only when I ran
gpctl clusters deregister --name temp --force
did it get marked as stopped.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sync feedback from @geropl: Marking all instances as stopped in this
bridge.stop()
lifecycle method is a change in behavior -- does it really make sense?One the one hand (i.e. "my" view),
bridge.stop()
is only called:So this means that
bridge.stop()
is only ever called when you're actually de-registering a cluster, and instead of leaving all the instances as they are in the DB without any further updates, maybe it makes more sense to mark them as stopped/failed.On the other hand (i.e. @geropl's view), maybe we need to be clearer or more careful about the fact that calling
bridge.stop()
will also mark all of its instances as stopped -- i.e. this doesn't seem to be called currently when you try to reconnect/restart a bridge or so, but we need to make sure that it also won't be called in the future with the assumption that instances will be kept running). Maybe this can be achieved with a comment, or by renaming thestop
function. We could also make a bridge mark all its instances as stopped only when we receive a deregister --force RPC, but this seems a bit more complicated (need to extract/share stopping code or somehow give access to the bridge to the RPC handler), and wouldn't handle the cases where a cluster is manually dropped from the DB.My personal conclusion here would be to leave the PR as is, and simply add very explicit comments to stress the fact that calling
bridge.stop()
also marks all its instances as stopped in the DB. Or, we could even rename the method tobridge.tearDown()
orbridge.markAllInstancesAsStoppedAndDispose
or similar. 😊There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you describe makes sense to me, it just seems that the lifecycle you are attaching the effect of stopping all instances is not well understood (I agree we should better understand it, and remove code that is not called in practice). Attaching the clean-up to the disconnect lifecycle also requires it to be always called, so that we don't leak/forget about workspaces.
It seems more general and less complicated to clean up instances for which we don't know a manager on a regular schedule instead of on disconnect events.