Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler Logic redesign #1001

Open
DiegoTavares opened this issue Jul 27, 2021 · 6 comments
Open

Scheduler Logic redesign #1001

DiegoTavares opened this issue Jul 27, 2021 · 6 comments
Labels
enhancement Improvement to an existing feature

Comments

@DiegoTavares
Copy link
Collaborator

Opening an issue to start drafting a proposal for the new scheduler logic, as discussed in the last TSC meeting (Jul 21).

Problems with the current design

  • The scheduling logic depends on the database to keep all instances of cuebot in the same page, which can be a scalability issue as increasing the number of cuebots also increases the load on the database eventually slowing down queries
  • The current design relies on multiple threadpools with manually configured sizes and limits. Getting the optimal number of threads for each threadpool can be tricky as the time required to execute each type of task might change depending on how fast the database queries are, which is influenced directly by the number of jobs and their status. A better solution would be able to calibrate itself depending on the load, or even not depend on different threadpools to require calibration using some form of back-pressure logic.
  • The current design allows multiple instances to try to book the same job for multiple hosts, only failing when the database identifies a conflict at the dispatch step, meaning all computation up to the dispatch step was a wast of CPU time.

Proposal

TBD

@DiegoTavares DiegoTavares added the enhancement Improvement to an existing feature label Jul 27, 2021
@splhack
Copy link
Contributor

splhack commented Jul 27, 2021

@splhack
Copy link
Contributor

splhack commented Aug 1, 2021

In our environment, no matter how many Cuebot instances there, the frame launching speed is about 8 frames-per-second. It could have been faster if we can avoid this error by improving the scheduling 🙂

frame reservation error, dispatchProcToJob failed to book next frame, com.imageworks.spcue.dispatcher.FrameReservationException: the frame ... was updated by another thread.

An idea. Introduce a new frame state, SCHEDULING or something like that.

  1. findNextDispatchFrames
    • Update the frame state to SCHEDULING, set the current timestamp to ts_updated where frame.str_state='WAITING', or, passed certain time from frame.ts_updated with frame.str_state='SCHEDULING' to prevent stray frames by Cuebot crash.
    • At the same time, retrieve the updated frames with RETURNING clause.
  2. Schedule the frame!
    • the frame ... was updated by another thread won't happen because findNextDispatchFrames is atomic.

@splhack
Copy link
Contributor

splhack commented Nov 30, 2021

Summarized an experimental optimization and the theory in #1069

To solve the scalability issues in #1012 and #1069, my hunch is that we need some sort of a central scheduler process (or one of Cuebot instance can work like that).

Possible Logic

  • Cuebots notify the central scheduler process when
    • Cuebot received a job submission, job priority change (group change)
      • The central scheduler process inserts/moves the job to the sorted job list
    • Cuebot dispatched a frame
      • The central scheduler process updates the job CPU/GPU utilization
    • Cuebot received a frame complete message or Cuebot stopped a frame
      • The central scheduler process updates the job CPU/GPU utilization
      • Also remove the job from the sorted job list if it finished
  • The central scheduler process also can reconstruct the sorted job list and CPU/GPU utilization from the ground up.
  • Cuebots get a frame from the central scheduler process for dispatch

Maybe leveldb is one of the best solution. Two maps.

  1. Sorted Job list
  • Key: the combination of int_priority + ts_started(maybe ULID), or random number for the current round-robin scheduling
  • Value: Job UUID, group/job/layer CPU/GPU utilization
  1. Job UUID to the key map
  • Key: Job UUID
  • Value: the key of sorted Job list

@DiegoTavares
Copy link
Collaborator Author

I like the idea of a central scheduler process, we're currently evaluating Redis-Stream as an option to handle not only the Job's queue, but also the HostReports. Will update this issue as soon as we have more to share.

@oliviascarfone
Copy link
Contributor

Proposal - High level overview

Central Scheduler Design Logic

Use Redis Stream for incoming HostReports and for Dispatching Jobs. Redis Streams support persistent store and ordered events and also has the ability to store multiple keys/values per event.

This approach will decouple the processing of HostReports from the dispatch of jobs. Redis Streams with consumer groups guarantees that each message is given to a different consumer (same message will not reach multiple consumers within the same group). This addresses the current flaw where Cuebot instances will assign jobs that have already been dispatched to other Cuebot instances. There will be two types of streams. One in which RQD publishes HostReports that are consumed by Cuebot, and the other where Cuebot publishes available jobs and RQDs consume jobs. In this later case, Cuebot will periodically query the database in order to get a list of jobs that are available for processing.

Logic for Host Reports Queue

  • RQD acts as producer and will send HostReports to a dedicated Redis Stream for HostReports

    • At least once semantics (dropping is less critical, as host reports are sent on an interval)
  • All Cuebot instances are added to the same Consumer Group and are listening for incoming messages.

  • HostReports are then stored in the database

  • In RQD: create connection to Redis server in RqCore module

  • In Cuebot: create class RedisConsumer, which can be initialized at application start up and connect to the Redis server, and will await incoming messages

Logic for Job Queue

  • Cuebot acts as the producer. Using a service like zookeeper to elect an instance as the leader, the database will be polled on an interval by the leader for available jobs and published as a priority queue to the Redis Stream dedicated to pending jobs

  • Limit the amount of Cuebot instances accessing the database directly to one (elected leader instance will access db)

  • RQD as consumer gets the highest priority job and determines if this job can be run on that host. RQD will send acknowledgement of job if it can be run, otherwise job remains in a pending state for other hosts to check.

    • Another idea: configure consumer groups based on RQD host characteristics and dispatch jobs to specific consumer groups based on job requirements

@thunders82
Copy link

thunders82 commented Apr 1, 2022

Hi,

It's nice to see you are looking into the schedule logic redesign.
I shared my opinion about the way the GPU nodes are handled in #991 and was kindly pointed to this thread by @splhack to share my opinion.

Currently, if a GPU is not in use by any GPU job it will not accept any CPU job. It is a waste of resource. I was wondering if it would be possible to implement a similar to logic :

Prio to GPU task on GPU nodes :

  • CPU job should be assigned to GPU node if no more CPU node available and no GPU job in the queue.
  • If a CPU job is running on a GPU node and no other GPU node available --> Kill the CPU job and retry on a different node to give room to GPU job waiting in the queue.

What do you think ?

Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement to an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants