Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[9.0] Simplify and speed up the SiteDirector #7110

Merged
merged 5 commits into from
Jan 12, 2024

Conversation

aldbr
Copy link
Contributor

@aldbr aldbr commented Jul 18, 2023

This PR aims at removing the extra logic checking the number of waiting jobs and pilots before submitting new pilots.
This logic does not work well in some cases:

  • sites are always processing jobs (this is the case of LHCb, not sure it is the case for all VOs)
  • multiple site directors run in parallel (two of them might look submit pilots at the same time for a same task queue, which means the logic of "preventing one to submit new pilots if another has already submitted the necessary number to process the waiting jobs" is not useful)

Here are the main changes:

  • introduction of submission policies:

    • AggressiveFilling: will just fill up the available slots, no matter whether there are waiting jobs (good for sites that are always processing jobs)
    • WaitingSupportedJobs: will only fill up some of the available slots, according to the number of waiting jobs. It does not take into account the number of already submitted pilots for these jobs (might result in some unused pilots but should not be a big issue, as it should target sites that are just occasionally used).
  • parallel submission: submission of pilots can now be done in parallel on the different CEs.

  • parallel accounting in the monitoring phase: this was done sequentially before, and it was taking a lot of time.

Other suggestions to discuss:

  • splitting the submission and the monitoring operations into two agents PilotSubmitter and PilotMonitor (could be merged with PilotStatusAgent):
    • Pros: both operations would be quickly executed (none of them would hamper the other one).
    • Cons: might result in mutiple instances of both PilotSubmitter and PilotMonitor if one of each is not enough for every site (this will surely be the case for LHCb).
  • adding an intermediate submission policy SoftFilling: would check the N last Running (for some time), Done pilots and would get the average number of currentJobID. Then it would adapt the number of pilots to submit according to the value (e.g. for a given queue, 80% of the N pilots have a currentJobID, which means we have too much pilots, so only 80% of the MAX waiting pilots should be in the queue). This would be a intermediate policy that would fit with sites generally processing some jobs, but not enough to entirely fill up the queues with pilots.

BEGINRELEASENOTES
*WorkloadManagement
CHANGE: simplify and speed up the SiteDirector
ENDRELEASENOTES

@fstagni
Copy link
Contributor

fstagni commented Jul 24, 2023

  • sites are always processing jobs (this is the case of LHCb, not sure it is the case for all VOs)

This is not always the case. Apart from our certification system(s), also some production installations don't have that.

@fstagni
Copy link
Contributor

fstagni commented Jul 24, 2023

splitting the submission and the monitoring operations into two agents PilotSubmitter and PilotMonitor (could be merged with PilotStatusAgent):

  • Pros: both operations would be quickly executed (none of them would hamper the other one).
  • Cons: might result in mutiple instances of both PilotSubmitter and PilotMonitor if one of each is not enough for every site (this will surely be the case for LHCb).

We might discuss this for DiracX, not for DIRAC.

@aldbr aldbr force-pushed the v8.1_FEAT_SDLogicChange branch 4 times, most recently from f4c9c19 to 5c87e00 Compare September 22, 2023 06:52
@aldbr aldbr marked this pull request as ready for review September 22, 2023 06:55
@aldbr aldbr requested a review from atsareg as a code owner September 22, 2023 06:55
@aldbr aldbr force-pushed the v8.1_FEAT_SDLogicChange branch from 5c87e00 to 00cc3e7 Compare September 22, 2023 13:55
@fstagni fstagni self-requested a review October 5, 2023 08:33
@fstagni fstagni changed the title [8.1] Simplify and speed up the SiteDirector [9.0] Simplify and speed up the SiteDirector Oct 31, 2023
@aldbr aldbr force-pushed the v8.1_FEAT_SDLogicChange branch 6 times, most recently from 55c4016 to 78d2301 Compare November 16, 2023 10:28
@aldbr aldbr requested a review from andresailer as a code owner November 16, 2023 10:28

@deprecated("Use addPilotTQRef")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deprecation was introduced for v9.

Copy link
Contributor Author

@aldbr aldbr Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes indeed.
How should we handle that?

Above you said:

The name of the method should then be changed to export_addPilotReference, and taskQueueID would not be needed anymore. But really it should be better to then deprecate the old method name (could be done in v8r0).

Should I make a PR targeting v8.0 to introduce addPilotReferences and deprecate addPilotTQReference (which would be removed from v9.0)?
In this case addPilotTQRef would also be removed from v9.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecation was introduced in #7157 because of the DB simplifications and remove of DN. Maybe what you are suggesting would work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created #7358

@aldbr aldbr force-pushed the v8.1_FEAT_SDLogicChange branch from 372788c to f9cd518 Compare December 8, 2023 16:22
@fstagni fstagni merged commit b4a2e9a into DIRACGrid:integration Jan 12, 2024
23 checks passed
@DIRACGridBot DIRACGridBot added the sweep:ignore Prevent sweeping from being ran for this PR label Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sweep:ignore Prevent sweeping from being ran for this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants