-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler Logic redesign #1001
Comments
|
In our environment, no matter how many Cuebot instances there, the frame launching speed is about 8 frames-per-second. It could have been faster if we can avoid this error by improving the scheduling 🙂
An idea. Introduce a new frame state,
|
Summarized an experimental optimization and the theory in #1069 To solve the scalability issues in #1012 and #1069, my hunch is that we need some sort of a central scheduler process (or one of Cuebot instance can work like that).
Possible Logic
Maybe leveldb is one of the best solution. Two maps.
|
I like the idea of a central scheduler process, we're currently evaluating Redis-Stream as an option to handle not only the Job's queue, but also the HostReports. Will update this issue as soon as we have more to share. |
Proposal - High level overviewCentral Scheduler Design Logic Use Redis Stream for incoming HostReports and for Dispatching Jobs. Redis Streams support persistent store and ordered events and also has the ability to store multiple keys/values per event. This approach will decouple the processing of HostReports from the dispatch of jobs. Redis Streams with consumer groups guarantees that each message is given to a different consumer (same message will not reach multiple consumers within the same group). This addresses the current flaw where Cuebot instances will assign jobs that have already been dispatched to other Cuebot instances. There will be two types of streams. One in which RQD publishes HostReports that are consumed by Cuebot, and the other where Cuebot publishes available jobs and RQDs consume jobs. In this later case, Cuebot will periodically query the database in order to get a list of jobs that are available for processing. Logic for Host Reports Queue
Logic for Job Queue
|
Hi, It's nice to see you are looking into the schedule logic redesign. Currently, if a GPU is not in use by any GPU job it will not accept any CPU job. It is a waste of resource. I was wondering if it would be possible to implement a similar to logic : Prio to GPU task on GPU nodes :
What do you think ? Thank you |
Opening an issue to start drafting a proposal for the new scheduler logic, as discussed in the last TSC meeting (Jul 21).
Problems with the current design
Proposal
TBD
The text was updated successfully, but these errors were encountered: