Computational Backend: management and autoscaling of external clusters #4159

sanderegg · 2023-04-25T06:37:48Z

solution 3 below was selected

Tasks

Give feedback

Dask backend: autoscaling in osparc-dask-scheduler #4522

10 of 10

a:autoscaling a:dask-service
Dask backend: Create a separate cluster for group automatically #4521

10 of 10

a:dask-service a:director-v2
Options

In order to fulfill the wishes for s4l:web the computational backend needs to be refactored.

As a summary here are the user requirements:

BringYourOwnLicense: off
User connects to oSparc
User starts s4l, prepare some simulation and wants to run the simulation
User needs to run a computational service on a computer with C CPUs, R RAM, G GPUs of type T with V VRAM
User shall select the computer in the oSparc UI, which will be made available by the system getting appropriate computer instances in the cloud (AWS) and connect it to the computational backend
User shall run the computational service
The system shall remove the computer instance once the computational service has run
User shall pay for the use of that computer for running his service H hours
User shall be able to select appropriate computer instances via the oSparc API

Open points

Do we still have the shared computational backend as well? where the usual templates are running for example?
Are organisations going to handle their access to computers? For example, I am organisation X administrator, I am ok to have 20 computers max for my users to share. is that a use-case?
When do we create the additional computers? is it only on demand? that is when the job is asked to run? this will require autoscaling somehow

Possible implementations

1 Dask-scheduler to rule them all

when a job is submitted to the dask-scheduler, director-v2 always gives the list of workers to be used (e.g. dask.submit(workers=["worker-1", "worker-2"])) thus a user dedicated computer is only used by the current user
oSparc creates a computer on the cloud, connects it to the default dask-scheduler and somehow makes it available to the computational backend for that user/group
some kind of autoscaling in the cluster that checks what type of machines are necessary to run the jobs
collection of metrics for billing

+/-

might be simpler to implement

scales only until the dask-scheduler dies or the docker swarm as there might be potentially a lot of computers connected together
strain on the main docker swarm, might affect function of the rest of oSparc
questionable scaling of the dask-scheduler (can it be scaled inside the same swarm? chat GPT says yes but doc says otherwise...)
every user will load the same dask-scheduler
if the dask-scheduler dies, everyone loses their jobs

2 Separate dask-scheduler to rule them all

move the current default dask-scheduler/dask-workers from the main swarm to a separate cluster (already feasible)
everything else works like in 1.

+/-

no strain on the main docker swarm
scales only until the dask-scheduler dies or the docker swarm as there might be potentially a lot of computers connected together
questionable scaling of the dask-scheduler (can it be scaled inside the same swarm? chat GPT says yes but doc says otherwise...)
every user will load the same dask-scheduler
if the dask-scheduler dies, everyone loses their jobs

3 dask-scheduler per user/group

a separate cluster is created per user/group (with a tiny/cheap machine for the scheduler/gateway?)
some kind of API might be necessary to be controlled by oSparc (for selecting machine types, collecting metrics/billing, switch on/off the whole cluster?)
some kind of autoscaling in the cluster that checks what type of machines are necessary to run the jobs and provides them on demand (might be easier as it is only for a group(s))
collection of metrics for billing (might be easier since it is for a group(s)?)

+/-

scales only until the dask-scheduler dies or the docker swarm as there might be potentially a lot of computers connected together but should happen later, as only user/group is using that dask-scheduler/swarm
only users/groups load the dask-scheduler
if the dask-scheduler dies, only the user/group loses their jobs
more complex but might scales much better
using the dask-gateway for this or not is questionable

4 dask-scheduler per computer type

a separate cluster is created per machine type (e.g. computers with 4 GPUs of type T)
autoscaling that adds/removes machines base on load of the scheduler or gateway needs
modification of the dask-gateway necessary if used

+/-

maybe more complex metrics/billing process as several users of different groups will be using these machines
autoscaling might be simpler than in 3 but collecting metrics more complex
probably using the dask-gateway here with different users might be desirable to separate costs and prevent a dask-scheduler crash to affect everyone as this would create a separate one
handling of autoscaling with dask-gateway might be more complex in the end

The text was updated successfully, but these errors were encountered:

mguidon · 2023-04-25T08:04:42Z

The user will also need to be able to chose among ec2 types when running via API
Selection of ec2 instances will also be required for the "curated" list of dynamic services (jsmash, iSeg, ...)
I would vote for an option 3.5, namely, assigning one dask-gateway to a collection of groups (alphabetically? a-f, g-l, m-r, ..., or having one/two dask-gateways with homogeneous ec2 instance types
I would have one gateway with static resources (no autoscaling) for the "osparc part" or keep it as it is now
I think the limit on ec2 instances per user is not necessary and should be handled by a generic warning email if the user exceeds costs

sanderegg · 2023-05-12T10:16:15Z

after discussions, solution 3 is prefered.

pcrespov · 2023-07-25T08:54:34Z

@sanderegg Yes, I like better option 3 as well. I think user/group focus is key. Now I would even say wallet focus so the monitoring for billing is simplified. It should support a hybrid cluster (i.e. different type of computers)

sanderegg · 2023-11-15T08:23:25Z

current solution implemented is:
1 scheduler per user/wallet
up-scaling is still undergoing but is moved to a separate issue, so we can close that one

sanderegg added a:director-v2 issue related with the director-v2 service a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:frontend issue affecting the front-end (area group) a:webserver issue related to the webserver service labels Apr 25, 2023

sanderegg self-assigned this Apr 25, 2023

sanderegg changed the title ~~Computational Backend: deep changes~~ Computational Backend: deep refactoring Apr 25, 2023

sanderegg assigned mguidon Apr 25, 2023

mguidon mentioned this issue May 9, 2023

sim4life.io - WP4: Computational backend ITISFoundation/osparc-issues#950

Open

sanderegg changed the title ~~Computational Backend: deep refactoring~~ Computational Backend: autoscale external clusters Jul 21, 2023

sanderegg changed the title ~~Computational Backend: autoscale external clusters~~ Computational Backend: creation and autoscaling of external clusters Jul 21, 2023

sanderegg changed the title ~~Computational Backend: creation and autoscaling of external clusters~~ Computational Backend: management and autoscaling of external clusters Jul 21, 2023

matusdrobuliak66 added this to the Sundae milestone Jul 24, 2023

sanderegg closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computational Backend: management and autoscaling of external clusters #4159

Computational Backend: management and autoscaling of external clusters #4159

sanderegg commented Apr 25, 2023 •

edited

Loading

Tasks

mguidon commented Apr 25, 2023

sanderegg commented May 12, 2023

pcrespov commented Jul 25, 2023

sanderegg commented Nov 15, 2023

Computational Backend: management and autoscaling of external clusters #4159

Computational Backend: management and autoscaling of external clusters #4159

Comments

sanderegg commented Apr 25, 2023 • edited Loading

solution 3 below was selected

Tasks

As a summary here are the user requirements:

Open points

Possible implementations

1 Dask-scheduler to rule them all

+/-

2 Separate dask-scheduler to rule them all

+/-

3 dask-scheduler per user/group

+/-

4 dask-scheduler per computer type

+/-

mguidon commented Apr 25, 2023

sanderegg commented May 12, 2023

pcrespov commented Jul 25, 2023

sanderegg commented Nov 15, 2023

sanderegg commented Apr 25, 2023 •

edited

Loading