Implement recovery for Kubernetes/OpenShift infrastructures #5919

sleshchenko · 2017-08-07T09:06:48Z

Kubernetes/OpenShift workspaces are considered as stopped when workspace master is restarted.
It is needed to implement recovery for Kubernetes/OpenShift workspaces, so workspaces will be considered as running after the restart of the master.
Recovering should be adapted to Rolling Update of a workpace master. So, recovery workflow should look like:

Pod CheServer is running and Service redirects to it. There may be running, starting or stopping workspaces.
Pod CheServer* is starting. After the start, it should know about all workspaces, it should be possible to interact with running workspace (requests its servers, stop it).
Pod CheServer* is running Service routes all traffic to it.
CheServer is stopping. It should finish all operation with workspaces (stopping and starting workspaces).
CheServer* should know about finished operations and pick up RUNNING workspaces.

sleshchenko · 2017-12-12T10:05:43Z

Depends on #7785

sleshchenko · 2018-03-16T15:33:44Z

Today we have the following status on this issue:

Looks like there is no an ability restore all running workspaces when tomcat is booted using only Kubernetes/OpenShift client and checking create objects on a cluster like a recovery is implemented in Docker Infrastructure.

Another proposed way to recovery workspaces was implementing lazy recovery when each workspace will be recovered only when a workspace is requested by a client. In this case request for getting workspaces list (GET /api/workspace) would initiate several requests to K8s/OS cluster and it would increase response time. Because of that, it was decided not to implement it.

So, it's needed to persist somewhere (like a database) metainformation of running workspaces for further recovery of them.

Also, the scope of this issue was extended and it is required to make Kubernetes/OpenShift infrastructure ready for Rolling Update (Issue description is updated). In this case, recovery should be implemented in the following way:

Pod CheServer is running and Service redirects to it. There may be running, starting or stopping workspaces.
Pod CheServer* is starting. After the start, it should know about all workspaces, it should be possible to interact with running workspace (requests its servers, stop it).
Pod CheServer* is running Service routes all traffic to it.
CheServer is stopping. It should finish all operation with workspaces (stopping and starting workspaces).
CheServer* should know about finished operations and pick up RUNNING workspaces.

Since there may be two running Che Server instances at the same time, it's not enough to rework infrastructure, because of Workspace API has own local cache. So Workspace API should be reworked to use local/(persistent or distributed) depending on configuration.

More details about Workspace API and Kubernetes/OpenShift changes will be described soon.

sleshchenko · 2018-03-21T09:52:59Z

During Rolling Update at some period of time, there will be two instances of Che Server.
So, it’s needed to somehow synchronize them and data which are held in memory.

Kubernetes/OpenShift infrastructure changes

It is proposed to implement OpenShift Recovery in the following way:

OpenShift infrastructure persists meta information of Runtimes which are active (starting, running, stopping).
Meta information includes

    - namespace
    - machines []
         - machineName
         - podName
         - containerName
         - attributes
         - servers[]
                - url
                  status
                  Attributes

OpenShift infrastructure fetch persisted Runtimes while evaluating of active runtimes https://github.com/eclipse/che/blob/master/wsmaster/che-core-api-workspace/src/main/java/org/eclipse/che/api/workspace/server/spi/RuntimeInfrastructure.java#L77
OpenShift context will use persisted Runtimes for recovering active ones.
OpenShiftRuntimes flush their statuses (machines, servers) to the persistent layer.

In this manner, OpenShift infrastructures will be synchronized on an old Che Server Pod and an updated One.

Also, here is one more thing that should be covered properly, it’s servers readiness probes. It should not produce any issues if two Che Servers will do servers checks on RUNNING runtimes. But only one Che Server should perform initial servers checking on STARTING runtimes. Another Che Server should launch own servers checks only when runtimes become RUNNING.

Workspace API changes

As about Workspace API is also should be patched a bit. It is required to synchronize workspace statuses cache in WorkspaceRuntimes between Che Servers instances. Looks like using distributed cache without persisting is enough. Because infrastructure will recover all persisted runtimes after Che Server start.
This part can be done as a separated issue #9206.

Also, not to force users to reload a page, it's needed to sync between instances (maybe persist) JSON RPC subscribers.

While Che Server shutdown it should have enough time to finish all workspace related operation, like STARTING or STOPPING of workspaces.

Should be disabled a feature of stopping all workspaces(Workspace service termination) before a stop of the Che Server.

Some aspects of Rolling Update and OpenShift Runtimes recovering may be missed, but I hope this information shows the plan how OpenShift going to be implemented.

sleshchenko · 2018-04-03T07:53:45Z

Created one more separated task that should be done for using Kubernetes/OpenShift recovering functionality. It is about WorkspaceServiceTermination adaptation #9317

sleshchenko added kind/task Internal things, technical debt, and to-do tasks to be performed. team/platform labels Aug 7, 2017

skabashnyuk mentioned this issue Aug 7, 2017

OpenShift infrastructure implementation of SPI #5098

Closed

26 tasks

benoitf changed the title ~~Implement recovery for OpenShift infrastructure~~ [SPI] Implement recovery for OpenShift infrastructure Sep 15, 2017

benoitf added the target/branch Indicates that a PR will be merged into a branch other than master. label Sep 15, 2017

l0rd mentioned this issue Sep 19, 2017

CHE-287 Investigate how to update Che using RollingUpdate strategy redhat-developer/rh-che#122

Closed

akorneta self-assigned this Sep 20, 2017

garagatyi changed the title ~~[SPI] Implement recovery for OpenShift infrastructure~~ Implement recovery for OpenShift infrastructure Nov 21, 2017

garagatyi added the target/che6 label Nov 21, 2017

garagatyi mentioned this issue Jan 15, 2018

Make it possible to run multiple che-server in parallel #7662

Closed

gorkem mentioned this issue Feb 12, 2018

Hot update for wsmaster #8547

Closed

19 tasks

sleshchenko self-assigned this Feb 15, 2018

skabashnyuk mentioned this issue Mar 1, 2018

Platform-2018-03-27 (Sprint: 146) #8971

Closed

12 tasks

skabashnyuk added the sprint/current label Mar 7, 2018

sleshchenko added the status/in-progress This issue has been taken by an engineer and is under active development. label Mar 16, 2018

sleshchenko changed the title ~~Implement recovery for OpenShift infrastructure~~ Implement recovery for Kubernetes/OpenShift infrastructures Mar 16, 2018

sleshchenko added kind/enhancement A feature request - must adhere to the feature request template. and removed kind/task Internal things, technical debt, and to-do tasks to be performed. labels Mar 23, 2018

skabashnyuk mentioned this issue Mar 23, 2018

Platform-2018-04-17 (Sprint: 147) #9199

Closed

9 tasks

This was referenced Mar 23, 2018

Add an ability to use distributed cache for storing workspace statuses in WorkspaceRuntimes #9206

Closed

Implemented recovery functionality for Kubernetes/OpenShift infrastructures #9301

Merged

sleshchenko mentioned this issue Apr 5, 2018

Improve abstract InternalRuntime to be more flexible for recovery functionality #9345

Merged

sleshchenko removed the target/branch Indicates that a PR will be merged into a branch other than master. label Apr 11, 2018

sleshchenko closed this as completed Apr 12, 2018

sleshchenko removed the status/in-progress This issue has been taken by an engineer and is under active development. label Apr 12, 2018

skabashnyuk removed the sprint/current label Apr 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement recovery for Kubernetes/OpenShift infrastructures #5919

Implement recovery for Kubernetes/OpenShift infrastructures #5919

sleshchenko commented Aug 7, 2017 •

edited

Loading

sleshchenko commented Dec 12, 2017

sleshchenko commented Mar 16, 2018

sleshchenko commented Mar 21, 2018 •

edited

Loading

sleshchenko commented Apr 3, 2018

Implement recovery for Kubernetes/OpenShift infrastructures #5919

Implement recovery for Kubernetes/OpenShift infrastructures #5919

Comments

sleshchenko commented Aug 7, 2017 • edited Loading

sleshchenko commented Dec 12, 2017

sleshchenko commented Mar 16, 2018

sleshchenko commented Mar 21, 2018 • edited Loading

Kubernetes/OpenShift infrastructure changes

Workspace API changes

sleshchenko commented Apr 3, 2018

sleshchenko commented Aug 7, 2017 •

edited

Loading

sleshchenko commented Mar 21, 2018 •

edited

Loading