Scheduler Logic redesign #1001

DiegoTavares · 2021-07-27T20:31:40Z

Opening an issue to start drafting a proposal for the new scheduler logic, as discussed in the last TSC meeting (Jul 21).

Problems with the current design

The scheduling logic depends on the database to keep all instances of cuebot in the same page, which can be a scalability issue as increasing the number of cuebots also increases the load on the database eventually slowing down queries
The current design relies on multiple threadpools with manually configured sizes and limits. Getting the optimal number of threads for each threadpool can be tricky as the time required to execute each type of task might change depending on how fast the database queries are, which is influenced directly by the number of jobs and their status. A better solution would be able to calibrate itself depending on the load, or even not depend on different threadpools to require calibration using some form of back-pressure logic.
The current design allows multiple instances to try to book the same job for multiple hosts, only failing when the database identifies a conflict at the dispatch step, meaning all computation up to the dispatch step was a wast of CPU time.

Proposal

TBD

splhack · 2021-07-27T22:14:17Z

FIFO scheduling
- [Cuebot] Add FIFO scheduling capability #1060
GPU scheduling logic
- Improvements/ideas for GPU support #991
- For example, we might want a way to assign CPUs evenly to each GPU. (40 CPU cores / 8 GPU units → 5 CPU cores per GPU.) We can set it if each single machine has the same configuration, not possible if not.
One frame per machine
- https://lists.aswf.io/g/opencue-user/message/411
- how can I achieve that one machine get only one frame and that frame render use all cores on that machine, regardless how many cores the given machine have
Limit per machine
- The current limit works well with licenses, but not enough with per-machine.
- For example, Arnold license allows to run multiple instances on the same machine. https://arnoldsupport.com/2015/02/05/running-multiple-arnold-instances-on-the-same-computer/

splhack · 2021-08-01T08:34:32Z

In our environment, no matter how many Cuebot instances there, the frame launching speed is about 8 frames-per-second. It could have been faster if we can avoid this error by improving the scheduling 🙂

frame reservation error, dispatchProcToJob failed to book next frame, com.imageworks.spcue.dispatcher.FrameReservationException: the frame ... was updated by another thread.

An idea. Introduce a new frame state, SCHEDULING or something like that.

findNextDispatchFrames
- Update the frame state to SCHEDULING, set the current timestamp to ts_updated where frame.str_state='WAITING', or, passed certain time from frame.ts_updated with frame.str_state='SCHEDULING' to prevent stray frames by Cuebot crash.
- At the same time, retrieve the updated frames with RETURNING clause.
Schedule the frame!
- the frame ... was updated by another thread won't happen because findNextDispatchFrames is atomic.

splhack · 2021-11-30T05:48:27Z

Summarized an experimental optimization and the theory in #1069

To solve the scalability issues in #1012 and #1069, my hunch is that we need some sort of a central scheduler process (or one of Cuebot instance can work like that).

Schedule dispatch frame #1012 - need to avoid wasteful duplicated frame scheduling by multiple Cuebot processes and threads.
Optimize FIND_JOBS_BY_SHOW #1069 - need to reduce sort computation cost with sorted list [insertion: O(log N)], cannot avoid linear search to check the utilization though.

Possible Logic

Cuebots notify the central scheduler process when
- Cuebot received a job submission, job priority change (group change)
  - The central scheduler process inserts/moves the job to the sorted job list
- Cuebot dispatched a frame
  - The central scheduler process updates the job CPU/GPU utilization
- Cuebot received a frame complete message or Cuebot stopped a frame
  - The central scheduler process updates the job CPU/GPU utilization
  - Also remove the job from the sorted job list if it finished
The central scheduler process also can reconstruct the sorted job list and CPU/GPU utilization from the ground up.
Cuebots get a frame from the central scheduler process for dispatch

Maybe leveldb is one of the best solution. Two maps.

Sorted Job list

Key: the combination of int_priority + ts_started(maybe ULID), or random number for the current round-robin scheduling
Value: Job UUID, group/job/layer CPU/GPU utilization

Job UUID to the key map

Key: Job UUID
Value: the key of sorted Job list

DiegoTavares · 2021-12-01T18:18:16Z

I like the idea of a central scheduler process, we're currently evaluating Redis-Stream as an option to handle not only the Job's queue, but also the HostReports. Will update this issue as soon as we have more to share.

oliviascarfone · 2021-12-15T22:44:35Z

Proposal - High level overview

Central Scheduler Design Logic

Use Redis Stream for incoming HostReports and for Dispatching Jobs. Redis Streams support persistent store and ordered events and also has the ability to store multiple keys/values per event.

This approach will decouple the processing of HostReports from the dispatch of jobs. Redis Streams with consumer groups guarantees that each message is given to a different consumer (same message will not reach multiple consumers within the same group). This addresses the current flaw where Cuebot instances will assign jobs that have already been dispatched to other Cuebot instances. There will be two types of streams. One in which RQD publishes HostReports that are consumed by Cuebot, and the other where Cuebot publishes available jobs and RQDs consume jobs. In this later case, Cuebot will periodically query the database in order to get a list of jobs that are available for processing.

Logic for Host Reports Queue

RQD acts as producer and will send HostReports to a dedicated Redis Stream for HostReports
- At least once semantics (dropping is less critical, as host reports are sent on an interval)
All Cuebot instances are added to the same Consumer Group and are listening for incoming messages.
HostReports are then stored in the database
In RQD: create connection to Redis server in RqCore module
In Cuebot: create class RedisConsumer, which can be initialized at application start up and connect to the Redis server, and will await incoming messages

Logic for Job Queue

Cuebot acts as the producer. Using a service like zookeeper to elect an instance as the leader, the database will be polled on an interval by the leader for available jobs and published as a priority queue to the Redis Stream dedicated to pending jobs
Limit the amount of Cuebot instances accessing the database directly to one (elected leader instance will access db)
RQD as consumer gets the highest priority job and determines if this job can be run on that host. RQD will send acknowledgement of job if it can be run, otherwise job remains in a pending state for other hosts to check.
- Another idea: configure consumer groups based on RQD host characteristics and dispatch jobs to specific consumer groups based on job requirements

thunders82 · 2022-04-01T16:41:22Z

Hi,

It's nice to see you are looking into the schedule logic redesign.
I shared my opinion about the way the GPU nodes are handled in #991 and was kindly pointed to this thread by @splhack to share my opinion.

Currently, if a GPU is not in use by any GPU job it will not accept any CPU job. It is a waste of resource. I was wondering if it would be possible to implement a similar to logic :

Prio to GPU task on GPU nodes :

CPU job should be assigned to GPU node if no more CPU node available and no GPU job in the queue.
If a CPU job is running on a GPU node and no other GPU node available --> Kill the CPU job and retry on a different node to give room to GPU job waiting in the queue.

What do you think ?

Thank you

DiegoTavares added the enhancement Improvement to an existing feature label Jul 27, 2021

splhack mentioned this issue Aug 6, 2021

Schedule dispatch frame #1012

Closed

oliviascarfone mentioned this issue Nov 18, 2021

Schedule docs #1062

Closed

splhack mentioned this issue Mar 31, 2022

Improvements/ideas for GPU support #991

Open

DiegoTavares mentioned this issue Sep 14, 2022

Draft: Scheduler redesign using Redis #1187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler Logic redesign #1001

Scheduler Logic redesign #1001

DiegoTavares commented Jul 27, 2021

splhack commented Jul 27, 2021 •

edited

Loading

splhack commented Aug 1, 2021

splhack commented Nov 30, 2021 •

edited

Loading

DiegoTavares commented Dec 1, 2021

oliviascarfone commented Dec 15, 2021

thunders82 commented Apr 1, 2022 •

edited

Loading

Scheduler Logic redesign #1001

Scheduler Logic redesign #1001

Comments

DiegoTavares commented Jul 27, 2021

Problems with the current design

Proposal

splhack commented Jul 27, 2021 • edited Loading

splhack commented Aug 1, 2021

splhack commented Nov 30, 2021 • edited Loading

Possible Logic

DiegoTavares commented Dec 1, 2021

oliviascarfone commented Dec 15, 2021

Proposal - High level overview

Logic for Host Reports Queue

Logic for Job Queue

thunders82 commented Apr 1, 2022 • edited Loading

splhack commented Jul 27, 2021 •

edited

Loading

splhack commented Nov 30, 2021 •

edited

Loading

thunders82 commented Apr 1, 2022 •

edited

Loading