Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change how we handle user sandbox #7461

Closed
belforte opened this issue Nov 17, 2022 · 6 comments
Closed

change how we handle user sandbox #7461

belforte opened this issue Nov 17, 2022 · 6 comments

Comments

@belforte
Copy link
Member

belforte commented Nov 17, 2022

Action item of postmortem CRAB TaskWorker went down after too many task submissions

Maybe there was a special reason in the past, but currently we may simply download sandbox.tar.gx from S3 in the scheduler. E.g. as part of https://github.com/dmwm/CRABServer/blob/master/scripts/AdjustSites.py

Problem:

  • user sandboxes use a lot of disk on TW and may quickly cumulate in case of repeated submissions.
  • as architecture goes, TW has no need of sandbox info, why should it touch it ?
  • spooling sandboxes during dag submissions all in all adds load to HTCondor and make operation take more time.

Things which we can do:

  • remove sandbox from TW disk once submission process is over (even if it failed, a new crab submit will create anyhow a new tmp directory) . A partial solution
  • go all the way and only download in scheduler. Scheduler code will also have to take care of debug subdirectory

Caveats:

@mapellidario
Copy link
Member

Wa and I had a quick chat about this. We had a simple concern. How should we deal with http errors when downloading sandboxes from s3 directly into the schedd? Because with this option we would have a task that is properly submitted with a running dagman_bootstrap, but that fails to retrieve the sandbox and can not submit jobs to the vanilla universe. how long do we keep trying downloading the sandbox? how many attempts do we make? should we put the dagman on hold if it fails for more than 1d? or should we just kill/remove it?

I am sorry, I have more questions than good proposals at the moment, but in general I like the proposal, we should try to avoid keeping files where they are not 100% needed.

@belforte
Copy link
Member Author

belforte commented Nov 17, 2022

there is no clear cut answer to those (valid) concerns. But keep in mind that dagman boostrap process already makes a few calls to CRAB REST. So nothing new on that side. Talking with S3 will be a new dependency. I do not know why original developers decided to send sandbox via condor file spooling, but it is possible that in original implementation there was no communication from scheduler to CRAB REST. We have a lot of such situations: original decisions stuck around even if original motivations were not valid anymore, different people took different decisions for similar things, something was done to mitigate the then-problem-of-the-day which eventually got solved otherwise, etc. Very few of those decisions have been documented.

So we are free to take the decision which we think is best, and will have to live with the consequences.

Note: failures in bootstrap: currently if something goes wrong in dagman bootstrap & C (e.g. can't talk with REST), everything is aborted and crab status will report a "task failed to bootstrap" and user submits again. If things go horribly wrong, the bootstrap does not manage to abort and task gets simply stuck in "waiting to bootstrap" and
and crab status command will print "If this persists report it to ..computingTools…”

@novicecpp
Copy link
Contributor

@belforte
Copy link
Member Author

belforte commented Jul 5, 2024

let's start by keeping tmp directory for a shorter time #8542

@novicecpp
Copy link
Contributor

I am looking at this issue today.

  • crab submit --dryrun break because InputFiles.tar.gz does not contains sandbox.tar.gz anymore. But we can ignore it because we want to remove it anyway.
  • Need to check crab preparelocal.

@belforte
Copy link
Member Author

once #6544 is done, user sandbox will not be downloaded on TW tmp disk any more.
The original problem will be gone.

No further action is needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants