You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @lipari: I am more closely looking into the scheduler code that may be on the critical path of the job throughput as part of #183 and came across this code.
I may have read the code wrong, but It seems our backfill plugin can perform a full resource tree search many times while trying to reserve those nodes that will be de-allocated soonest. My guess is that this may not scale as much as we like, in particular if the reservation is large and we have relatively large numbers of small jobs currently running.
You and I also discussed other ways to improve tree traversal and lookup. Please regard this issue as a handle to discuss issues relevant to resrc performance and scalability.
Just to make it clear, I haven't quantified the impact of this on the overall job throughput and not asking you to take on some immediate work.
The text was updated successfully, but these errors were encountered:
FYI -- PR #274 landed which will be one of the core performance-guaranteeing layer for this problem. The early findings for scheduler-driven aggregate update scheme is documented in #269.
Hi @lipari: I am more closely looking into the scheduler code that may be on the critical path of the job throughput as part of #183 and came across this code.
I may have read the code wrong, but It seems our backfill plugin can perform a full resource tree search many times while trying to reserve those nodes that will be de-allocated soonest. My guess is that this may not scale as much as we like, in particular if the reservation is large and we have relatively large numbers of small jobs currently running.
You and I also discussed other ways to improve tree traversal and lookup. Please regard this issue as a handle to discuss issues relevant to resrc performance and scalability.
Just to make it clear, I haven't quantified the impact of this on the overall job throughput and not asking you to take on some immediate work.
The text was updated successfully, but these errors were encountered: