-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change how we handle user sandbox #7461
Comments
Wa and I had a quick chat about this. We had a simple concern. How should we deal with http errors when downloading sandboxes from s3 directly into the schedd? Because with this option we would have a task that is properly submitted with a running dagman_bootstrap, but that fails to retrieve the sandbox and can not submit jobs to the vanilla universe. how long do we keep trying downloading the sandbox? how many attempts do we make? should we put the dagman on hold if it fails for more than 1d? or should we just kill/remove it? I am sorry, I have more questions than good proposals at the moment, but in general I like the proposal, we should try to avoid keeping files where they are not 100% needed. |
there is no clear cut answer to those (valid) concerns. But keep in mind that dagman boostrap process already makes a few calls to CRAB REST. So nothing new on that side. Talking with S3 will be a new dependency. I do not know why original developers decided to send sandbox via condor file spooling, but it is possible that in original implementation there was no communication from scheduler to CRAB REST. We have a lot of such situations: original decisions stuck around even if original motivations were not valid anymore, different people took different decisions for similar things, something was done to mitigate the then-problem-of-the-day which eventually got solved otherwise, etc. Very few of those decisions have been documented. So we are free to take the decision which we think is best, and will have to live with the consequences. Note: failures in bootstrap: currently if something goes wrong in dagman bootstrap & C (e.g. can't talk with REST), everything is aborted and crab status will report a "task failed to bootstrap" and user submits again. If things go horribly wrong, the bootstrap does not manage to abort and task gets simply stuck in "waiting to bootstrap" and |
This happened again yesterday https://mattermost.web.cern.ch/cms-o-and-c/pl/9m9bcm3dnfbrj8h8i7t9id45no |
let's start by keeping |
I am looking at this issue today.
|
once #6544 is done, user sandbox will not be downloaded on TW tmp disk any more. No further action is needed |
Action item of postmortem CRAB TaskWorker went down after too many task submissions
Maybe there was a special reason in the past, but currently we may simply download
sandbox.tar.gx
from S3 in the scheduler. E.g. as part of https://github.com/dmwm/CRABServer/blob/master/scripts/AdjustSites.pyProblem:
Things which we can do:
debug
subdirectoryCaveats:
The text was updated successfully, but these errors were encountered: