-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[9.0] Simplify and speed up the SiteDirector #7110
Conversation
src/DIRAC/WorkloadManagementSystem/Service/PilotManagerHandler.py
Outdated
Show resolved
Hide resolved
This is not always the case. Apart from our certification system(s), also some production installations don't have that. |
We might discuss this for DiracX, not for DIRAC. |
f4c9c19
to
5c87e00
Compare
5c87e00
to
00cc3e7
Compare
55c4016
to
78d2301
Compare
|
||
@deprecated("Use addPilotTQRef") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This deprecation was introduced for v9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes indeed.
How should we handle that?
Above you said:
The name of the method should then be changed to export_addPilotReference, and taskQueueID would not be needed anymore. But really it should be better to then deprecate the old method name (could be done in v8r0).
Should I make a PR targeting v8.0 to introduce addPilotReferences
and deprecate addPilotTQReference
(which would be removed from v9.0)?
In this case addPilotTQRef
would also be removed from v9.0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deprecation was introduced in #7157 because of the DB simplifications and remove of DN. Maybe what you are suggesting would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created #7358
372788c
to
f9cd518
Compare
This PR aims at removing the extra logic checking the number of waiting jobs and pilots before submitting new pilots.
This logic does not work well in some cases:
Here are the main changes:
introduction of submission policies:
AggressiveFilling
: will just fill up the available slots, no matter whether there are waiting jobs (good for sites that are always processing jobs)WaitingSupportedJobs
: will only fill up some of the available slots, according to the number of waiting jobs. It does not take into account the number of already submitted pilots for these jobs (might result in some unused pilots but should not be a big issue, as it should target sites that are just occasionally used).parallel submission: submission of pilots can now be done in parallel on the different CEs.
parallel accounting in the monitoring phase: this was done sequentially before, and it was taking a lot of time.
Other suggestions to discuss:
PilotSubmitter
andPilotMonitor
(could be merged withPilotStatusAgent
):PilotSubmitter
andPilotMonitor
if one of each is not enough for every site (this will surely be the case for LHCb).SoftFilling
: would check the N lastRunning
(for some time),Done
pilots and would get the average number ofcurrentJobID
. Then it would adapt the number of pilots to submit according to the value (e.g. for a given queue, 80% of the N pilots have acurrentJobID
, which means we have too much pilots, so only 80% of the MAX waiting pilots should be in the queue). This would be a intermediate policy that would fit with sites generally processing some jobs, but not enough to entirely fill up the queues with pilots.BEGINRELEASENOTES
*WorkloadManagement
CHANGE: simplify and speed up the SiteDirector
ENDRELEASENOTES