Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-reschedule jobs waiting for long time? #7802

Open
fstagni opened this issue Sep 23, 2024 · 1 comment
Open

Auto-reschedule jobs waiting for long time? #7802

fstagni opened this issue Sep 23, 2024 · 1 comment
Assignees
Labels
Milestone

Comments

@fstagni
Copy link
Contributor

fstagni commented Sep 23, 2024

What you are pointing out is one of the (several) conditions through which we can create jobs that stay in "Waiting" status for potentially a long time, maybe "forever". At least 2 connected unavoidable cases:

  1. At time x the job goes to "Waiting", after getting its replicas. At time x+y (before any matching attempt) the site's RunningLimit is set to 0, and further attempts of matching will fail.
  2. Jobs without input data would not check their replicas. A user/bot can ask to run at a specific site for which its RunningLimit is 0, with or without implementing your proposal.

The list can go on, but story short there is no way to fully avoid creating jobs that will Wait for "long" time.

I also do not like much the getReplicasForJobs checks.

One other possibility is reset jobs that have been in "Waiting" for long time (because conditions of e.g. the allowed replicas might have, in the meantime, changed -- that is why the JobWrapper calls again getReplicasForJobs). Would that be a bad idea? Did we by chance think at that in the past already? -- cc @atsareg

Originally posted by @fstagni in #7735 (comment)

@fstagni fstagni added the WMS label Sep 23, 2024
@fstagni fstagni added this to the v8.0 milestone Sep 23, 2024
@fstagni
Copy link
Contributor Author

fstagni commented Oct 3, 2024

From discussion: monitoring of jobs in status "Waiting" (for long and for short time) is useful, but not take immediate actions like reset.

@fstagni fstagni self-assigned this Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant