Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computational Backend: management and autoscaling of external clusters #4159

Closed
2 tasks done
Tracked by #950
sanderegg opened this issue Apr 25, 2023 · 4 comments
Closed
2 tasks done
Tracked by #950
Assignees
Labels
a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:director-v2 issue related with the director-v2 service a:frontend issue affecting the front-end (area group) a:webserver issue related to the webserver service
Milestone

Comments

@sanderegg
Copy link
Member

sanderegg commented Apr 25, 2023

solution 3 below was selected

Tasks

  1. 10 of 10
    a:autoscaling a:dask-service
    sanderegg
  2. 10 of 10
    a:dask-service a:director-v2
    sanderegg

In order to fulfill the wishes for s4l:web the computational backend needs to be refactored.

As a summary here are the user requirements:

  • BringYourOwnLicense: off
  • User connects to oSparc
  • User starts s4l, prepare some simulation and wants to run the simulation
  • User needs to run a computational service on a computer with C CPUs, R RAM, G GPUs of type T with V VRAM
  • User shall select the computer in the oSparc UI, which will be made available by the system getting appropriate computer instances in the cloud (AWS) and connect it to the computational backend
  • User shall run the computational service
  • The system shall remove the computer instance once the computational service has run
  • User shall pay for the use of that computer for running his service H hours
  • User shall be able to select appropriate computer instances via the oSparc API

Open points

  • Do we still have the shared computational backend as well? where the usual templates are running for example?
  • Are organisations going to handle their access to computers? For example, I am organisation X administrator, I am ok to have 20 computers max for my users to share. is that a use-case?
  • When do we create the additional computers? is it only on demand? that is when the job is asked to run? this will require autoscaling somehow

Possible implementations

1 Dask-scheduler to rule them all

  • when a job is submitted to the dask-scheduler, director-v2 always gives the list of workers to be used (e.g. dask.submit(workers=["worker-1", "worker-2"])) thus a user dedicated computer is only used by the current user
  • oSparc creates a computer on the cloud, connects it to the default dask-scheduler and somehow makes it available to the computational backend for that user/group
  • some kind of autoscaling in the cluster that checks what type of machines are necessary to run the jobs
  • collection of metrics for billing

+/-

  • might be simpler to implement
  • scales only until the dask-scheduler dies or the docker swarm as there might be potentially a lot of computers connected together
  • strain on the main docker swarm, might affect function of the rest of oSparc
  • questionable scaling of the dask-scheduler (can it be scaled inside the same swarm? chat GPT says yes but doc says otherwise...)
  • every user will load the same dask-scheduler
  • if the dask-scheduler dies, everyone loses their jobs

2 Separate dask-scheduler to rule them all

  • move the current default dask-scheduler/dask-workers from the main swarm to a separate cluster (already feasible)
  • everything else works like in 1.

+/-

  • no strain on the main docker swarm
  • scales only until the dask-scheduler dies or the docker swarm as there might be potentially a lot of computers connected together
  • questionable scaling of the dask-scheduler (can it be scaled inside the same swarm? chat GPT says yes but doc says otherwise...)
  • every user will load the same dask-scheduler
  • if the dask-scheduler dies, everyone loses their jobs

3 dask-scheduler per user/group

  • a separate cluster is created per user/group (with a tiny/cheap machine for the scheduler/gateway?)
  • some kind of API might be necessary to be controlled by oSparc (for selecting machine types, collecting metrics/billing, switch on/off the whole cluster?)
  • some kind of autoscaling in the cluster that checks what type of machines are necessary to run the jobs and provides them on demand (might be easier as it is only for a group(s))
  • collection of metrics for billing (might be easier since it is for a group(s)?)

+/-

  • scales only until the dask-scheduler dies or the docker swarm as there might be potentially a lot of computers connected together but should happen later, as only user/group is using that dask-scheduler/swarm
  • only users/groups load the dask-scheduler
  • if the dask-scheduler dies, only the user/group loses their jobs
  • more complex but might scales much better
  • using the dask-gateway for this or not is questionable

4 dask-scheduler per computer type

  • a separate cluster is created per machine type (e.g. computers with 4 GPUs of type T)
  • autoscaling that adds/removes machines base on load of the scheduler or gateway needs
  • modification of the dask-gateway necessary if used

+/-

  • maybe more complex metrics/billing process as several users of different groups will be using these machines
  • autoscaling might be simpler than in 3 but collecting metrics more complex
  • probably using the dask-gateway here with different users might be desirable to separate costs and prevent a dask-scheduler crash to affect everyone as this would create a separate one
  • handling of autoscaling with dask-gateway might be more complex in the end
@sanderegg sanderegg added a:director-v2 issue related with the director-v2 service a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:frontend issue affecting the front-end (area group) a:webserver issue related to the webserver service labels Apr 25, 2023
@sanderegg sanderegg self-assigned this Apr 25, 2023
@sanderegg sanderegg changed the title Computational Backend: deep changes Computational Backend: deep refactoring Apr 25, 2023
@mguidon
Copy link
Member

mguidon commented Apr 25, 2023

  • The user will also need to be able to chose among ec2 types when running via API
  • Selection of ec2 instances will also be required for the "curated" list of dynamic services (jsmash, iSeg, ...)
  • I would vote for an option 3.5, namely, assigning one dask-gateway to a collection of groups (alphabetically? a-f, g-l, m-r, ..., or having one/two dask-gateways with homogeneous ec2 instance types
  • I would have one gateway with static resources (no autoscaling) for the "osparc part" or keep it as it is now
  • I think the limit on ec2 instances per user is not necessary and should be handled by a generic warning email if the user exceeds costs

@sanderegg
Copy link
Member Author

after discussions, solution 3 is prefered.

@sanderegg sanderegg changed the title Computational Backend: deep refactoring Computational Backend: autoscale external clusters Jul 21, 2023
@sanderegg sanderegg changed the title Computational Backend: autoscale external clusters Computational Backend: creation and autoscaling of external clusters Jul 21, 2023
@sanderegg sanderegg changed the title Computational Backend: creation and autoscaling of external clusters Computational Backend: management and autoscaling of external clusters Jul 21, 2023
@matusdrobuliak66 matusdrobuliak66 added this to the Sundae milestone Jul 24, 2023
@pcrespov
Copy link
Member

@sanderegg Yes, I like better option 3 as well. I think user/group focus is key. Now I would even say wallet focus so the monitoring for billing is simplified. It should support a hybrid cluster (i.e. different type of computers)

@sanderegg
Copy link
Member Author

current solution implemented is:
1 scheduler per user/wallet
up-scaling is still undergoing but is moved to a separate issue, so we can close that one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:director-v2 issue related with the director-v2 service a:frontend issue affecting the front-end (area group) a:webserver issue related to the webserver service
Projects
None yet
Development

No branches or pull requests

4 participants